Web Data Scraping

Bandana Vishwakarma
Jul 3, 2020
2 min read

What is it?

If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale.

Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.

More than a modern convenience, the true power of web scraping lies in its ability to build and power some of the world’s most revolutionary business applications.

Transformative doesn’t even begin to describe the way some companies use web scraped data to enhance their operations, informing executive decisions all the way down to individual customer service experiences.

Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when you view the page).

Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may beparsed, searched, reformatted, its data copied into a spreadsheet, and so on.

Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and phone numbers, or companies and their URLs, to a list.

Why Scraping?

Data mining for ML models

Sentiment Analysis

For valuable insights

Service provider

For your own Analysis of data

High accuracy of data extraction

Speed

Price monitoring

Stock market tracking

News and content monitoring

Market research

Steps to scrape

1. Choose the website for scraping data

2. Analyse HTML format of website(That's how data is shown on web)

3. Select what needed to extract

4. Create script to crawl data

5. Store in file/Database

6. If data changing regularly then run in again

Problems in Scraping

1. Complicated web pages structures

2. Changeable web structures

3. Recaptcha

4. Bot blocking

5. IP blocking

6. Dynamic content

7. Login access required

8. Honeypots

Libraries to use

1. Scrapy

2. Python requests

3. Beautiful soup 4

4. LXML

5. URLLIB

6. Selenium

7. MechanicalSoup

#Techneophyte #web #Scripting #web #development