Web Scraping Without Getting Blocked
When creating a web scraper, it is worth considering the possibility of blocking - not all services are friendly to scraping their data.
To reduce the number of bots using the site, developers use IP address recognition, HTTP request headers checking, CAPTCHA, and other methods to detect bot traffic. However, these are still possible to bypass. To do this, you need to follow some rules during scraping.
But even if the site doesn’t set any forbiddance, it is worth showing respect and not harming the web pages. Follow the rules outlined in robots.txt, don’t scrape data during peak hours, limit requests coming from the same IP address, and set delays between them.
Sets for Non-blocks
Firstly, set up the scraper in the right way.
Set Request Intervals
The most common mistake when creating web scrapers is using fixed intervals. People are not able to access the site after a strictly fixed period of time all 24 hours a day.
Therefore, it is necessary to set some interval within which the time between iterations will change. As a rule, it is better to install it for two seconds or more.
Also, don’t flip through the pages too fast. Stay on the web page for a while. Such imitation of user behavior will reduce the risk of blocking.
Discover the easiest way to get valuable SEO data from Google SERPs with our Google SERP Scraper! No coding is needed - just run, download, and analyze your SERP data in Excel, CSV, or JSON formats. Get started now for free!
Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download results in convenient JSON, CSV, and Excel formats.
Set User Agent
The User-Agent contains information about the user and the device being used. In other words, these are the data that the server receives at the time of the user visit. It helps the server to identify each visitor. And if a user with the same User-Agent makes too many requests, the server may ban him.
Therefore, it is worth considering introducing into the web scraper the ability to periodically change the User-Agent header to any other from the whole list in a random way. This will allow for avoiding blocking and continuing to collect information.
To view your own User Agent, go to DevTools (F12) and then to the Network tab.
Set Additional Request Headers
However, besides user agents, there are other headers that can sabotage the work of the scraper. Unfortunately, web scrapers and crawlers often send headers that differ from those sent by real web browsers. Therefore, it is worth taking the time to change all the headers so that they do not look like automatic requests that the bot sends.
As a rule, when using a browser by a real user, the headers “Accept”, “Accept-Encoding”, “Accept-Language” and “Upgrade-Insecure-Requests” are also filled in. Therefore, do not forget about them either. An example of filling such fields:
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
upgrade-insecure-requests: 1
Set Referer
The Referrer header shows the site from which the user came. If don’t know what to enter in this field, can use “google.com”. It can be either any other search engine (Yahoo.com, Bing.com, etc.) or any social media site. For example, it might look like this:
Referer: https://www.google.com/
Set Your Fingerprint Right
Whenever someone connects to a target website, their device sends a request that includes HTTP headers. These headers contain information such as the device’s time zone, language, privacy settings, cookies, and more. Web headers are transmitted by the browser each time a site is visited, and together are fairly unique.
For example, a certain combination of such parameters may be unique for approximately 200,000 users. Therefore, it is worth keeping up to date with such information. An alternative is to use third-party scraping services or resident IPs. To check own fingerprints can be used in the next service.
However, not only browser fingerprints should be right but TLS ones too. It is especially important to keep track of TLS/HTTP fingerprints, which are tracked by various sites. For example, most parsers use HTTP/1.1 and most browsers use HTTP/2 when available. Therefore, requests using HTTP/1.1 will be suspicious for most sites.
Other Ways to Avoid Blocks
So, if all the settings are done, it’s time to move on to the main traps and rules to follow.
Use Headless Browser
First of all, it should be noted that if possible, it is worth using a headless browser. They allow imitating user behavior, reducing the risk of blocking. If such a browser interferes, it is always possible to hide it and do everything in the background mode.
It will also help to receive even the data that is loaded using JavaScript or dynamic AJAX web pages. The most common headless browser is Chrome Headless, which most scraping libraries (for example, Selenium) work with.
Headless browsers introduce various style elements such as fonts, layouts, and colors, so they are harder to recognize and distinguish from a real user.
Use Proxy Server
If for a long time, requests come from the same place with a small interval, this behavior is not similar to that of a normal user. It’s more like a bot. However, so that the target website doesn’t suspect anything, one can use a proxy server.
In simple terms, a proxy is an intermediary computer that makes a request to the server instead of the client and returns a result to the client. Thus, the destination server thinks that the request is made from a completely different place, and therefore, by a completely different user.
Proxies are both free and paid. However, it is better not to use free proxies for data scraping - they are extremely slow and unreliable. Ideally, one should use residential or mobile proxies. In addition, it is not enough to use one proxy. For scraping, it is better to create a whole proxy pool.
Also, it is very important to keep track of the IP address from which the request is being made. In case the location does not match the expectations of the site, it can simply block them. For example, it is unlikely that local infrastructures will be useful for foreign users. So it’s better to use local proxies for parsing sites so as not to arouse suspicion.
Use CAPTCHA Solving Services
When there are too many requests, the site may offer to solve a captcha to make sure that the request is made by a real user and not a bot. In this case, services can help that, for a small fee, will automatically recognize the proposed captcha.
Avoid Honeypot Traps
To catch a bot, many websites use honeypot traps. In general, a honeypot is an empty link that does not exist on the page but is present in the source HTML code. When harvested automatically, these hooks can redirect the web scraper to decoy pages or blank pages.
In fact, they are very easy to spot. For such links, various “masking” CSS properties are specified. For example, “display: none”, “visibility: hidden” or the color of the link is identical to the background of the site.
Our pre-built Amazon Product Scraper is designed to pull all detailed product information, including reviews, prices, descriptions, images and brand from departments, categories, product pages, or Amazon searches. Download your data in JSON, CSV and Excel formats.
Scrape and collect data from any Shopify store without writing a single line of code! Download the collected data in Excel, CSV, and JSON formats - with Shopify Scraper, it's never been easier!
Avoid JavaScript
Scraping JavaScript, like images, is not actually something that causes blocking. But it is worth noting that not all libraries allow scraping such data, which means that a web scraper capable of collecting dynamic data will have more complex code and require more computing power.
Using Ready API in Web Scrapers
If it seems that the listed settings and rules are too many, and the costs of proxies and captcha-solving services are too high, one can do it easier and “redirect” the interaction with the site to third-party resources.
HasData offers a REST API for scraping web pages at any scale. The service takes care of IP blocks, IP Rotations, captchas, JavaScript rendering, finding and using residential or datacenter proxies, and setting HTTP headers and custom cookies. The user sets the query and the API returns data.
Tips & Tricks for Scraping
The last thing that is also worth mentioning is the time when it is better to scrape websites and the reverse engineering method in scraping. This is necessary not only in order to avoid blocking, but also in order not to harm the site.
Scrape During Off-peak Hours
Due to the fact that crawlers move through pages faster than real users, they significantly increase the load on the server. At the same time, if parsing is performed at a time of high load on the server, the speed of work of services falls down and the site loads more slowly.
This will not only negatively affect the traffic of the site by real users but also increase the time required for data collection.
Therefore, it is worth collecting and extracting data at moments of minimal site load. It is generally recommended to run the parser after midnight local site time.
Scrape at Different Day Times
If the site is heavily loaded daily from 8.00 am to 8.20 am, it starts to raise suspicions. Therefore, it is worth specifying some random value within which the scraping time will change.
Zillow Scraper is a powerful and easy-to-use software that allows you to quickly scrape property details from Zillow, such as address, price, beds/baths, square footage and agent contact data. With no coding required, you can get all the data you need in just a few clicks and download it in Excel, CSV or JSON formats.
Discover the easiest way to extract valuable data from Apartments.com with our advanced scraper tool - no coding required! Get accurate results instantly and download in Excel, CSV, and JSON formats.
Reverse Engineering for Better Scraping
Reverse engineering is a commonly used development method. In short, reverse engineering involves research of software applications to understand how they function.
In the case of developing a scraper, this approach means having a primary analysis for compiling future requests. The developer tools or simply DevTools in the browser (press F12) can help to analyze web pages.
Let’s try to take a closer look at google SERP. Go to the DevTools on the Network tab, then try to find something at google.com, and look at the completed request. To view the response, just click on the received request and go to the Preview tab:
This data helps to understand what exactly the request should return and in which form. The data from the header tab will help to understand what data should be sent to compile the request. The main thing is to correctly execute requests and correctly interpret the responses.
Reverse engineering of mobile applications
The situation is similar to the reverse engineering of mobile applications. Only in this case, it is necessary to intercept the request sent by the mobile application to the server. Unlike intercepting normal requests, to do it for mobile applications one should use a Man-In-The-Middle proxy, such as the Charles proxy.
Also, don’t forget that the requests sent by the mobile application are more complex and confusing.
Conclusion and Takeaways
Finally, let’s take a look at what security measures sites can take and what countermeasures can be taken to bypass them.
Security Measure | Countermeasure |
---|---|
Browser fingerprinting | Headless Browser |
Storing data in JavaScript | Headless Browser |
IP-rate limits | Proxy rotation |
TLS fingerprinting | Forge TLS fingerprint |
CAPTCHA | CAPTCHA solving services |
By following a number of simple rules that were listed above, you can not only avoid blocking but also significantly increase the efficiency of the scraper.
In addition, when creating a scraper, it is worth considering that many sites provide an API for obtaining data. And if there is such an opportunity, it is better to use them than manually collecting data from the site.
Might Be Interesting
Dec 6, 2024
XPath vs CSS Selectors: Pick Your Best Tool
Explore the key differences between CSS selectors and XPath, comparing their advantages, limitations, and use cases. Learn about performance, syntax, flexibility, and how to test and build selectors for web development.
- Basics
- Use Cases
Jun 10, 2024
10 Best Google SERP APIs to Extract Search Results Efficiently
Discover the 10 best Google SERP APIs to efficiently extract search engine results in 2024. Learn about their benefits, features, and find the ideal solution for your project needs.
- SEO Tools
- Basics
Feb 20, 2024
Proxies for Web Scraping - The Complete Guide
Discover different proxy types, functionalities, and protocols to choose the perfect solution for your needs. Explore dedicated vs shared proxies, static vs rotating IPs, and compare VPNs vs proxies for scraping.
- Basics