Related Products

Web Scraping Without Getting Blocked

Valentina Skakun

Last update: 22 May 2024

When creating a web scraper, it is worth considering the possibility of blocking - not all services are friendly to scraping their data.

To reduce the number of bots using the site, developers use IP address recognition, HTTP request headers checking, CAPTCHA, and other methods to detect bot traffic. However, these are still possible to bypass. To do this, you need to follow some rules during scraping.

But even if the site doesn’t set any forbiddance, it is worth showing respect and not harming the web pages. Follow the rules outlined in robots.txt, don’t scrape data during peak hours, limit requests coming from the same IP address, and set delays between them.

Sets for Non-blocks

Firstly, set up the scraper in the right way.

Set Request Intervals

The most common mistake when creating web scrapers is using fixed intervals. People are not able to access the site after a strictly fixed period of time all 24 hours a day.

Therefore, it is necessary to set some interval within which the time between iterations will change. As a rule, it is better to install it for two seconds or more.

Also, don’t flip through the pages too fast. Stay on the web page for a while. Such imitation of user behavior will reduce the risk of blocking.

Set User Agent

The User-Agent contains information about the user and the device being used. In other words, these are the data that the server receives at the time of the user visit. It helps the server to identify each visitor. And if a user with the same User-Agent makes too many requests, the server may ban him.

Therefore, it is worth considering introducing into the web scraper the ability to periodically change the User-Agent header to any other from the whole list in a random way. This will allow for avoiding blocking and continuing to collect information.

To view your own User Agent, go to DevTools (F12) and then to the Network tab.

Set Additional Request Headers

However, besides user agents, there are other headers that can sabotage the work of the scraper. Unfortunately, web scrapers and crawlers often send headers that differ from those sent by real web browsers. Therefore, it is worth taking the time to change all the headers so that they do not look like automatic requests that the bot sends.

As a rule, when using a browser by a real user, the headers “Accept”, “Accept-Encoding”, “Accept-Language” and “Upgrade-Insecure-Requests” are also filled in. Therefore, do not forget about them either. An example of filling such fields:

accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
upgrade-insecure-requests: 1

Set Referer

The Referrer header shows the site from which the user came. If don’t know what to enter in this field, can use “google.com”. It can be either any other search engine (Yahoo.com, Bing.com, etc.) or any social media site. For example, it might look like this:

Referer: https://www.google.com/

Set Your Fingerprint Right

Whenever someone connects to a target website, their device sends a request that includes HTTP headers. These headers contain information such as the device’s time zone, language, privacy settings, cookies, and more. Web headers are transmitted by the browser each time a site is visited, and together are fairly unique.

For example, a certain combination of such parameters may be unique for approximately 200,000 users. Therefore, it is worth keeping up to date with such information. An alternative is to use third-party scraping services or resident IPs. To check own fingerprints can be used in the next service.

However, not only browser fingerprints should be right but TLS ones too. It is especially important to keep track of TLS/HTTP fingerprints, which are tracked by various sites. For example, most parsers use HTTP/1.1 and most browsers use HTTP/2 when available. Therefore, requests using HTTP/1.1 will be suspicious for most sites.

Other Ways to Avoid Blocks

So, if all the settings are done, it’s time to move on to the main traps and rules to follow.

Use Headless Browser

First of all, it should be noted that if possible, it is worth using a headless browser. They allow imitating user behavior, reducing the risk of blocking. If such a browser interferes, it is always possible to hide it and do everything in the background mode.

It will also help to receive even the data that is loaded using JavaScript or dynamic AJAX web pages. The most common headless browser is Chrome Headless, which most scraping libraries (for example, Selenium) work with.

Headless browsers introduce various style elements such as fonts, layouts, and colors, so they are harder to recognize and distinguish from a real user.

Use Proxy Server

If for a long time, requests come from the same place with a small interval, this behavior is not similar to that of a normal user. It’s more like a bot. However, so that the target website doesn’t suspect anything, one can use a proxy server.

In simple terms, a proxy is an intermediary computer that makes a request to the server instead of the client and returns a result to the client. Thus, the destination server thinks that the request is made from a completely different place, and therefore, by a completely different user.

Proxies are both free and paid. However, it is better not to use free proxies for data scraping - they are extremely slow and unreliable. Ideally, one should use residential or mobile proxies. In addition, it is not enough to use one proxy. For scraping, it is better to create a whole proxy pool.

Also, it is very important to keep track of the IP address from which the request is being made. In case the location does not match the expectations of the site, it can simply block them. For example, it is unlikely that local infrastructures will be useful for foreign users. So it’s better to use local proxies for parsing sites so as not to arouse suspicion.

Use CAPTCHA Solving Services

When there are too many requests, the site may offer to solve a captcha to make sure that the request is made by a real user and not a bot. In this case, services can help that, for a small fee, will automatically recognize the proposed captcha.

Avoid Honeypot Traps

To catch a bot, many websites use honeypot traps. In general, a honeypot is an empty link that does not exist on the page but is present in the source HTML code. When harvested automatically, these hooks can redirect the web scraper to decoy pages or blank pages.

In fact, they are very easy to spot. For such links, various “masking” CSS properties are specified. For example, “display: none”, “visibility: hidden” or the color of the link is identical to the background of the site.

Avoid JavaScript

Scraping JavaScript, like images, is not actually something that causes blocking. But it is worth noting that not all libraries allow scraping such data, which means that a web scraper capable of collecting dynamic data will have more complex code and require more computing power.

Using Ready API in Web Scrapers

If it seems that the listed settings and rules are too many, and the costs of proxies and captcha-solving services are too high, one can do it easier and “redirect” the interaction with the site to third-party resources.

HasData offers a REST API for scraping web pages at any scale. The service takes care of IP blocks, IP Rotations, captchas, JavaScript rendering, finding and using residential or datacenter proxies, and setting HTTP headers and custom cookies. The user sets the query and the API returns data.

Tips & Tricks for Scraping

The last thing that is also worth mentioning is the time when it is better to scrape websites and the reverse engineering method in scraping. This is necessary not only in order to avoid blocking, but also in order not to harm the site.

Scrape During Off-peak Hours

Due to the fact that crawlers move through pages faster than real users, they significantly increase the load on the server. At the same time, if parsing is performed at a time of high load on the server, the speed of work of services falls down and the site loads more slowly.

This will not only negatively affect the traffic of the site by real users but also increase the time required for data collection.

Therefore, it is worth collecting and extracting data at moments of minimal site load. It is generally recommended to run the parser after midnight local site time.

Scrape at Different Day Times

If the site is heavily loaded daily from 8.00 am to 8.20 am, it starts to raise suspicions. Therefore, it is worth specifying some random value within which the scraping time will change.

Reverse Engineering for Better Scraping

Reverse engineering is a commonly used development method. In short, reverse engineering involves research of software applications to understand how they function.

In the case of developing a scraper, this approach means having a primary analysis for compiling future requests. The developer tools or simply DevTools in the browser (press F12) can help to analyze web pages.

Let’s try to take a closer look at google SERP. Go to the DevTools on the Network tab, then try to find something at google.com, and look at the completed request. To view the response, just click on the received request and go to the Preview tab:

This data helps to understand what exactly the request should return and in which form. The data from the header tab will help to understand what data should be sent to compile the request. The main thing is to correctly execute requests and correctly interpret the responses.

Reverse engineering of mobile applications

The situation is similar to the reverse engineering of mobile applications. Only in this case, it is necessary to intercept the request sent by the mobile application to the server. Unlike intercepting normal requests, to do it for mobile applications one should use a Man-In-The-Middle proxy, such as the Charles proxy.

Also, don’t forget that the requests sent by the mobile application are more complex and confusing.

Conclusion and Takeaways

Finally, let’s take a look at what security measures sites can take and what countermeasures can be taken to bypass them.

Security Measure	Countermeasure
Browser fingerprinting	Headless Browser
Storing data in JavaScript	Headless Browser
IP-rate limits	Proxy rotation
TLS fingerprinting	Forge TLS fingerprint
CAPTCHA	CAPTCHA solving services

By following a number of simple rules that were listed above, you can not only avoid blocking but also significantly increase the efficiency of the scraper.

In addition, when creating a scraper, it is worth considering that many sites provide an API for obtaining data. And if there is such an opportunity, it is better to use them than manually collecting data from the site.

Valentina Skakun

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.

Sets for Non-blocks Set Request Intervals Set User Agent Set Additional Request Headers Set Referer Set Your Fingerprint Right Other Ways to Avoid Blocks Use Headless Browser Use Proxy Server Use CAPTCHA Solving Services Avoid Honeypot Traps Avoid JavaScript Using Ready API in Web Scrapers Tips & Tricks for Scraping Scrape During Off-peak Hours Scrape at Different Day Times Reverse Engineering for Better Scraping Reverse engineering of mobile applications Conclusion and Takeaways

Articles

Might Be Interesting

Jul 21, 2025

Best Web Scraping APIs for 2025: Features, Speed, Price

Explore the best web scraping APIs for 2025. Compare performance, key features, and pricing to choose the ideal solution for your data needs.

Basics Business

Jun 18, 2025

Google’s AI Overviews Are Taking Over - Is SEO Dead?

Explore how Google’s AI Overviews are reshaping search results. Based on analysis of 150K+ SERPs, we break down where AI dominates, where SEO still thrives, and what it means for the future of search.

SEO Tools Business Basics

Dec 6, 2024

XPath vs CSS Selectors: Pick Your Best Tool

Explore the key differences between CSS selectors and XPath, comparing their advantages, limitations, and use cases. Learn about performance, syntax, flexibility, and how to test and build selectors for web development.

Basics Use Cases