Web Scraping: An Introduction

by Aug 10, 2021

At Ango AI, we are laser-focused on annotating data. Many of our customers provide us with data to be annotated, and often, this data is downloaded (scraped) from the Web. This article aims to outline the conceptual roots of web scraping and show certain tools and methodologies that are closely interrelated with it, so that you may get started with web scraping and get to know the terminology and methodology.

What is automated web scraping?

While web scraping in itself can be done manually by selecting, copying and downloading single items of interest from a webpage. Yet this is extremely time consuming and for most modern purposes inefficient. Hence either complete or partial automation is used. Automated Web scraping is the process of acquiring data from the web using tools and frameworks that parse content over the internet. Mostly, this content is presented through HTML, thus one can find the items of interest by sifting through the HTML code.

Before a piece of online content such as a website can be scraped, it is fetched (downloaded). This is usually done by a crawler, which is an automated software intended to fetch pages from source links. Once this page is fetched, the process of scraping can begin, which results in searching, extracting and/or storing the information of interest from the page. This formats the unstructured data from the on the website to a structured format which is of more benefit for the user’s purpose.

Web scraping overview (source)

Data that is scraped is clearly made public by the owners. The permissions regarding the use of the data vary on a case to case basis. Many websites store their policies for scraping / crawling software in a txt file called robots.txt which can be accessed via <URL>/robots.txt. This is google’s robots.txt, for example.

Legitimate web scraping can be done using two methods:

  1. Using appropriate headers to identify the agent sending the requests, for example when Google crawls a webpage it clearly identifies itself using the HTTP header saying that it belongs to Google.
  2. Abiding by the rules and permissions highlighted by the website in robots.txt or via some other method the website chooses to present its policies regarding its data. 

Approaches

There are various techniques that can be employed to scrape data from the internet and often the best one varies on a case by case basis. However, some of the most common approaches include:

Pattern Matching: This involves regex-like (regular expressions) pattern matching within the HTML to find the text of interest. This is similar to the Unix grep command where in a text document is parsed to find matches to a certain pattern.

DOM Parsing: Most effective for dynamically generated web pages, this method involves parsing the DOM tree after completely loading the dynamic web page using a software called webdriver. The tree can then be interpreted, and the information of interest extracted using languages such as Xpath. This is more efficient than pattern matching for most cases as traversing the DOM tree is often composed of fewer elements.

Computer Vision Analysis: This is a very novel technique which is still in the process of exploration which involves methods of analysing the visual data (pixels) present in the page by downloading or taking its snapshot. This is akin to how a human may analyse a page. This heavily relies on Machine Learning and specifically object detection and pattern recognition approaches. 

Tools

There is a plethora of tools, frameworks and libraries available for web scraping however the most common and often the method of choice for many organisations and individuals when it comes to web scraping is using Python and the following libraries:

Selenium: This is a library in python coined as a testing suite that many developers and testers use in order to debug interaction with their web applications. This allows the programmer to plan how the “bot” (which is the program using the selenium library) will interact with the web application. In the most simple terms it automates browsers, with a preplanned programmed set of actions that are to be performed iteratively.The major advantage Selenium provides is the ability to load JavaScript in web pages.

Selenium uses a webdriver mentioned above to navigate to a website, download the DOM and parse through the DOM tree to get the individual elements. It can use Xpath (along with a few other methods) to get the element of interest and extract information from it. 

BeautifulSoup: This is a python library for parsing static HTML and XML files. Although this does not parse dynamic web pages as Selenium does, it allows for similar element extraction and DOM tree traversal like selenium. It provides convenient methods for manipulating the “soup” parsed from the HTML. This is more beginner friendly and comparatively less complex than Selenium.

APIs

Although virtually any kind of data can be extracted from the internet via scraping, many organizations, social media platforms, search engines and financial institutions allow users to bypass the hassle of having to do so and provide structured, refined, formatted and filtered data through Application Programming Interfaces. These allow for the most efficient, legitimate and easy way to access data, of course within certain parameters and constraints. Consider the figure below to see the example of the Twitter API.

An API exposes endpoints of the internal architecture of the program, and may allow functionality to the programmer to call functions directly on the database through the API. It abstracts the internal details of the application and only allows the programmer (who wants to access the data) to only have the relevant details.

Interaction with the Twitter API (source)

Conclusion

Depending on your personal or organizational needs, different methods of data collection may fit best. However no matter what the case may be, having Web Scraping in one’s data collection arsenal will most certainly prove useful, especially in the data-hungry domain of machine learning.

Of course, once you’ve scraped the data you need, you’ll need it labeled. Ango AI provides an end-to-end, fully managed data labeling service for AI teams, for nearly all data types. With our ever-growing team of labelers and our in-house labeling platform, we provide high quality labeling for your scraped data.

Our labeling software allows our annotators to label data in a fast and efficient way. After labeling, our platform also allows for reviewers to verify that our labelers’ work is satisfactory, and that it meets and exceeds our high quality requirements. 

Once done, we export the annotations in various formats such as COCO or YOLO, depending on customer need.

To bring labeling speed to the next level, these tools will soon be supplemented by smart annotation techniques using AI assistance, drastically reducing the time of such tasks, from minutes to a matter of seconds.

Want to try it out?

Click on the link below to request a free, hassle-free and quick demo of our data labeling service.