Scraping is the process of using automated tools to collect large amounts of data output from an application, website or application programming interface (API). The most common tool used for scraping is bots. These bots extract HTML code and other data stored inside databases.
Data scraping is used for both legitimate purposes, as well as with illegal, nefarious intentions. Digital businesses may use scraping to harvest data for legal purposes, such as auto-fetching prices for price comparison sites or to perform market research from forums and social media platforms. On the other hand, malicious actors may use scraping to steal copyrighted content or undercut unlisted competitor prices.
Legitimate scraping typically uses pre-built bots, scripts, or scraping-as-a-service providers.
Malicious parties often create their own scripts for data scraping that don’t abide by certain restrictions, such as disguising themselves as real users.
The following are the typical steps in the malicious scraping process:
There are three primary types of data scraping.
Content scraping refers to when bots scrape the content present on a website. This information can then be replicated to mirror the unique advantages of products and services that rely on site content.
Price scraping using bots to pull data on prices. This can be used for legitimate purposes for comparison sites but can also be used to undercut competitor prices or create unique advantages over pricing plans.
Contact scraping pulls user data, such as email addresses and phone numbers. Spammers and scammers use this information for bulk email lists, robocalls and social engineering attacks.
Crawling is used to index content. The most common example of this is Google using Googlebots to crawl website content to inform search engine results. Crawler bots make no attempt to hide their identity when crawling sites.
Scraping specifically pulls data and stores it in other databases. Scraper bots typically hide their identity by pretending to be web browsers or users. They take more advanced actions than crawler bots, such as filling out form fields.
The following steps can be used to protect against data scraping: