What is it?
If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale.
Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.
More than a modern convenience, the true power of web scraping lies in its ability to build and power some of the world’s most revolutionary business applications.
Transformative doesn’t even begin to describe the way some companies use web scraped data to enhance their operations, informing executive decisions all the way down to individual customer service experiences.
Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when you view the page).
Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may beparsed, searched, reformatted, its data copied into a spreadsheet, and so on.
Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list.
Why Scraping?
Data mining for ML models
Sentiment Analysis
For valuable insights
Service provider
For your own Analysis of data
High accuracy of data extraction
Speed
Price monitoring
Stock market tracking
News and content monitoring
Market research
Steps to scrape
1. Choose the website for scraping data
2. Analyse HTML format of website(That's how data is shown on web)
3. Select what needed to extract
4. Create script to crawl data
5. Store in file/Database
6. If data changing regularly then run in again
Problems in Scraping
1. Complicated web pages structures
2. Changeable web structures
3. Recaptcha
4. Bot blocking
5. IP blocking
6. Dynamic content
7. Login access required
8. Honeypots
Libraries to use
1. Scrapy
2. Python requests
3. Beautiful soup 4
4. LXML
5. URLLIB
6. Selenium
7. MechanicalSoup
Comments