Web scraping with Node.js is fetching HTML with fetch, parsing it with Cheerio, and running Playwright when a page needs a real browser. The rest of this guide is each one through working code, the benchmarks behind every recommendation, and what to do when a page filters bots, rate-limits requests, or hides data in JavaScript.
You’ll need Node 22+, async/await, and Chrome DevTools.
Common Takeaways:
fetchandCheeriohandle 90% of scraping work in one npm install.p-limitat concurrency 5 cuts a 50-page scrape from 17.8s to 2.5s. Going to 10 or 20 parallel workers buys milliseconds, not orders of magnitude.- Playwright beats Puppeteer in 2026. Multi-browser, auto-wait, and ~800ms faster cold-start in my measurement.
- The stealth plugin takes a base Playwright instance from 6/11 to 11/11 on bot.sannysoft.com.
- The whole stack is five npm packages on Node 22.
Why Node.js for web scraping
Scraping is mostly waiting. A fetch call spends maybe 50ms reading TCP and 200-500ms sitting idle while the target server does its work. The Node event loop turns that idle time into capacity. You fire a hundred requests, hold a hundred connections open, and react to each one when it returns, all on one thread.
100 scrape requests then take about as long as one request plus event-loop overhead, not 100x one request. A 50-page scrape of books.toscrape.com goes from 17.8s sequential to 2.5s at concurrency 5.
You won’t write .then() chains to make scraping concurrent. cheerio, playwright, p-limit, and native fetch all return promises out of the box. The Node ecosystem assumed async from day one. You don’t fight Node to get parallelism. You fight Node to slow down enough not to get banned.
The Node.js scraping stack
The stack has three layers. An HTTP client gets the page, a parser pulls data out of the HTML, and a headless browser runs JavaScript when the page needs one.
HTTP clients
fetch is built into Node 22+. It handles headers, body parsing, timeouts via AbortController, and streaming responses, with nothing to install. The libraries below each solve a problem fetch does not handle.
The reason any of the libraries below are still relevant:
| Library | When to install |
|---|---|
fetch (built-in) | Default on Node 22+. |
axios | The codebase already uses it, or you want interceptors. |
ky | You want retry-with-backoff without writing the loop. |
got | Cookie jar across requests, or finer streaming control. |
node-crawler | You want a queue and throttling bundled in. |
I use fetch. The others get added when a specific feature is worth the dependency.
HTML parsers
cheerio loads HTML into a jQuery-like object. The selectors and traversal API are the ones a frontend dev already knows, applied to a string. Parsing a typical product page takes a few milliseconds and a few MB of memory.
jsdom implements the full browser DOM in JavaScript. Inline scripts can run, document.querySelector works, you can read computed styles. The same page parsed through jsdom takes around 10x the time and memory cheerio uses. Use jsdom only when a page builds its DOM with embedded scripts you cannot reach through cheerio.
Skip regular expressions for HTML. Nested tags, optional whitespace, attribute order, encoded characters, and comments all break a regex by the second weird page. The thirty seconds of “but it works on my test page” turns into a week of debugging.
Headless browsers
playwright is the default for new projects in 2026. Multi-browser (Chromium, Firefox, WebKit) and built-in auto-wait for selectors. Microsoft maintains it and releases every few weeks.
puppeteer is Chromium-only and older. The API is similar enough that translating between the two takes minutes. If you already have a Puppeteer codebase, there is no urgency to migrate. Start a new project on Playwright.
You need a browser when a page builds its content with JavaScript and the data isn’t in the raw HTML. A two-second check tells you which is which, and many JS-rendered pages also call a JSON endpoint you can hit directly for roughly 10x the speed of launching a browser.
Setting up a Node.js scraping project
The whole stack installs in three commands.
mkdir nodejs-scraper && cd nodejs-scraper
npm init -y && npm pkg set type=module
npm install cheerio playwright playwright-extra puppeteer-extra-plugin-stealth p-limit
npx playwright install chromiumWithout npm pkg set type=module, .js files default to CommonJS and any import statement throws:
SyntaxError: Cannot use import statement outside a modulenpx playwright install chromium downloads the browser into ~/.cache/ms-playwright/, about 350 MB. Drop the browser name and Playwright pulls Chromium, Firefox, and WebKit at around 750 MB total. For most scraping work, Chromium alone is enough.
The resulting package.json looks like this.
{
"name": "nodejs-scraper",
"version": "1.0.0",
"type": "module",
"dependencies": {
"cheerio": "^1.2.0",
"p-limit": "^7.3.0",
"playwright": "^1.60.0",
"playwright-extra": "^4.3.6",
"puppeteer-extra-plugin-stealth": "^2.11.2"
}
}A quick sanity check that everything wired up correctly.
node -e "import('cheerio').then(c => c.load('<a>ok</a>'))" && echo "cheerio OK"
node -e "import('playwright').then(p => p.chromium.launch().then(b => b.close()))" && echo "playwright OK"Both lines should print OK. If playwright fails to launch, you missed npx playwright install chromium.
Static or dynamic scraping
This is the first question to answer on any new target. Skip it and you waste an hour writing a fetch + Cheerio scraper against a page that renders in the browser, watching $('.product') return an empty array and wondering what went wrong.
A page is “static” when the HTML returned by fetch already contains your target data. “Dynamic” means the data is missing from that HTML and shows up only after JavaScript runs in a real browser.
The view-source vs rendered DOM check
Open your browser’s DevTools (F12) and navigate to the target. View-source (Ctrl+U on Windows/Linux, Cmd+Option+U on Mac) shows the literal HTML the server returned. The Elements panel shows the live DOM after JavaScript has built and modified it.
If your target data is visible in view-source, fetch + Cheerio will see it too. If it only shows up in Elements, the page is dynamic and you need either a browser or the page’s underlying JSON endpoint.
For a concrete example, open https://quotes.toscrape.com/js/. The Elements panel shows ten quote blocks. Hit Ctrl+U and search for the first quote text. It isn’t there. The HTML has an empty <div class="container"> and an inline <script> that builds the quotes from a var data = [...] array. Running Cheerio on the raw HTML and querying .quote returns nothing.

The probe.js utility
The view-source check works fine for one target. Here is a script that does the same check from the command line in two seconds, so you can run it against any URL without leaving the terminal.
The script reads the fetched HTML and scores four categories of signals.
- Empty SPA mount points like
<div id="root">,<div id="app">,<div id="__next">. Frameworks render into these from JavaScript. - A
<noscript>element warning that JavaScript is required. - Short body text. Under 200 chars scores 3 points, 200-500 chars scores 1. A real article or product page has more plain text than that.
- Five or more
<script src>tags. Bundler chunks usually mean a heavy client app.
Two or more points and the page is almost certainly dynamic.
// probe.js — usage: node probe.js <url>
import * as cheerio from 'cheerio';
const url = process.argv[2];
if (!url) {
console.error('Usage: node probe.js <url>');
process.exit(1);
}
const UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 15_7_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36';
const res = await fetch(url, { headers: { 'User-Agent': UA } });
const html = await res.text();
const $ = cheerio.load(html);
const signals = {};
const mountSelectors = ['#root', '#app', '#__next', '#__nuxt', '[data-reactroot]', '#svelte', '#ember-app'];
for (const sel of mountSelectors) {
if ($(sel).length && $(sel).text().trim().length < 50) {
signals.emptyMount = sel;
break;
}
}
if (/enable javascript|requires javascript/i.test($('noscript').text())) {
signals.noscriptWarning = true;
}
signals.scriptCount = $('script[src]').length;
signals.manyScripts = signals.scriptCount >= 5;
$('script, style, noscript').remove();
const bodyText = $('body').text().replace(/\s+/g, ' ').trim();
signals.veryShortBodyText = bodyText.length < 200;
signals.shortBodyText = bodyText.length >= 200 && bodyText.length < 500;
const score =
(signals.emptyMount ? 3 : 0) +
(signals.noscriptWarning ? 2 : 0) +
(signals.veryShortBodyText ? 3 : 0) +
(signals.shortBodyText ? 1 : 0) +
(signals.manyScripts ? 1 : 0);
console.log(score >= 2 ? 'YOU NEED A BROWSER' : 'CHEERIO WORKS');
console.log({ url, score, ...signals });Run it against a known static page and a known dynamic one.
$ node probe.js https://books.toscrape.com
CHEERIO WORKS
{
url: 'https://books.toscrape.com',
score: 1,
scriptCount: 5,
manyScripts: true,
veryShortBodyText: false,
shortBodyText: false
}
$ node probe.js https://quotes.toscrape.com/js/
YOU NEED A BROWSER
{
url: 'https://quotes.toscrape.com/js/',
score: 3,
scriptCount: 1,
manyScripts: false,
veryShortBodyText: true,
shortBodyText: false
}Server-rendered React stays in CHEERIO WORKS (react.dev scores 1/10 with this script). Heavy SPAs land in YOU NEED A BROWSER (airbnb.com scores 2/10, with 351 chars of body text and 40 script tags). The few targets that fall in the middle get a manual DevTools check.
Making HTTP requests for static pages
Native fetch covers the static path end to end. The parts worth knowing are headers, timeouts, status checks, and what to do when a request fails.
A minimal fetch request
const res = await fetch('https://books.toscrape.com');
const html = await res.text();That works for an open static page. Most sites need at least browser-realistic headers, a timeout, and 4xx/5xx handling on top of that.
Headers and User-Agent
A default Node fetch sends a User-Agent of node (literally that string). Many sites block or redirect that immediately. The fix is sending a browser-realistic UA from the same OS you’d browse from.
const headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 15_7_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br, zstd',
};
const res = await fetch(url, { headers });Grab the UA from your own browser. Chrome’s DevTools Network tab shows it on every outgoing request. Copy the User-Agent value from Request Headers. The string changes with each Chrome release (every ~4 weeks), so hardcoding Chrome/120 in 2026 is an anti-bot tell.
Request timeouts
A fetch call has no built-in deadline. A slow target hangs your scraper indefinitely. The fix is AbortController.
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 10000);
try {
const res = await fetch(url, { headers, signal: controller.signal });
const html = await res.text();
// do stuff
} finally {
clearTimeout(timeout);
}10 seconds is generous for most static pages. Drop it lower if your target is reliably fast and you want to fail fast on dead URLs.
Status checks
fetch resolves on any HTTP response. A 404 or 500 looks the same as a 200 to await fetch(). Check res.ok (or res.status directly) before reading the body.
const res = await fetch(url, { headers });
if (!res.ok) {
throw new Error(`${res.status} ${res.statusText} for ${url}`);
}
const html = await res.text();This catches the 4xx and 5xx responses before you try to parse a Cloudflare block page as if it were product data.
Retries on transient failures
A 429 (rate-limited) or 503 (overloaded) usually means try again later. A 404 means the URL is wrong and retrying won’t help.
async function fetchWithRetry(url, headers, retries = 3) {
for (let attempt = 1; attempt <= retries; attempt++) {
const res = await fetch(url, { headers });
if (res.ok) return await res.text();
if (![429, 503, 504].includes(res.status)) {
throw new Error(`${res.status} ${res.statusText} for ${url}`);
}
const wait = Math.min(1000 * 2 ** (attempt - 1), 8000);
await new Promise(r => setTimeout(r, wait));
}
throw new Error(`Out of retries for ${url}`);
}Exponential backoff (1s, 2s, 4s, capped at 8s) is the conventional polite retry strategy. Cap retries low (3-4) so dead URLs fail fast. Raise the cap only if your target really benefits from waiting through a backend hiccup.
When axios still makes sense
axios is fine if your codebase already uses it. For new scraping projects, fetch covers the same ground without the 50 KB of dependency.
Parsing HTML in Node.js
Two parsers handle the work. cheerio is the default and covers nearly everything. jsdom shows up only when you need to run inline scripts or read computed styles. Regex is not on the list, and the snippet below shows why.
const html1 = '<h3>Notebook (red)</h3>';
html1.match(/<h3>(.*?)<\/h3>/)[1];
// "Notebook (red)", works
const html2 = '<h3 data-testid="title">Notebook<span class="badge">new</span></h3>';
html2.match(/<h3>(.*?)<\/h3>/);
// null, the attribute on <h3> kills the match
html2.match(/<h3[^>]*>(.*?)<\/h3>/s)?.[1];
// "Notebook<span class=\"badge\">new</span>", nested tag ends up in the resultThe regex either misses the attribute on <h3>, eats nested tags into the result, or both. Cheerio handles both shapes.
Cheerio for HTML parsing
cheerio.load(html) returns a $ function with the same selectors, traversal, and helper methods as jQuery. Scraping the books.toscrape.com catalog takes a few lines.
import * as cheerio from 'cheerio';
const res = await fetch('https://books.toscrape.com');
const $ = cheerio.load(await res.text());
const books = $('article.product_pod').map((_, el) => ({
title: $(el).find('h3 a').attr('title'),
price: $(el).find('.price_color').text().trim(),
rating: ($(el).find('.star-rating').attr('class') || '').replace('star-rating ', ''),
link: new URL($(el).find('h3 a').attr('href'), 'https://books.toscrape.com').href,
})).get();
console.log(books.slice(0, 2));Output:
[
{
title: 'A Light in the Attic',
price: '£51.77',
rating: 'Three',
link: 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
},
{
title: 'Tipping the Velvet',
price: '£53.74',
rating: 'One',
link: 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'
}
]The pattern across most scrapers stays this short. Find your repeating element, .map() over it, pull each field with a selector and a .text() or .attr(). .each() does the same thing without returning a value when you only need side effects.
A few details that come up.
$(el).text()returns the joined text of an element and all descendants. Useful for grabbing whatever text is inside a container. Wrap in.trim()or.replace(/\s+/g, ' ').trim()to clean up whitespace.- Missing elements return jQuery-style empty selections.
$(el).find('.maybe').text()returns'', notundefined. Code that checks for “is it missing” needs.lengthinstead. - Attribute access is
.attr('name'). Returnsundefinedif missing. - Relative URLs need
new URL(href, baseUrl).hrefto become absolute links.
When Cheerio isn’t enough
Cheerio doesn’t execute JavaScript. Pages that build part of their content from inline scripts that need to run, like a JSON blob plus a decompression script, leave Cheerio seeing the tags but not the resulting DOM.
jsdom runs those scripts and gives you the rendered tree. The cost is about 10x parsing time and memory compared to Cheerio. Pull it in only when you see data tied to a script that needs to execute.
A complete static scraper example
Everything from the previous two sections connected into one runnable script that walks the full books.toscrape.com catalog, normalizes prices and ratings, and writes JSON and CSV.
The script has four pieces. fetchPage handles the request and status check. extractBooks runs the Cheerio mapping and normalizes the price string into a Number, the rating word into 1-5, and the relative link into an absolute URL. nextPageUrl reads the pagination link. A top-level loop walks pages until there is no next link.
// scrape-books-static.js
import * as cheerio from 'cheerio';
import { writeFileSync } from 'node:fs';
import { performance } from 'node:perf_hooks';
const BASE = 'https://books.toscrape.com/';
const RATINGS = { One: 1, Two: 2, Three: 3, Four: 4, Five: 5 };
const headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 15_7_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
};
async function fetchPage(url) {
const res = await fetch(url, { headers });
if (!res.ok) throw new Error(`${res.status} ${res.statusText} for ${url}`);
return cheerio.load(await res.text());
}
function extractBooks($) {
return $('article.product_pod').map((_, el) => {
const ratingWord = ($(el).find('.star-rating').attr('class') || '').replace('star-rating ', '');
const priceText = $(el).find('.price_color').text().trim();
return {
title: $(el).find('h3 a').attr('title'),
price: Number(priceText.replace(/[^\d.]/g, '')),
currency: priceText.replace(/[\d.\s]/g, ''),
rating: RATINGS[ratingWord] ?? null,
link: new URL($(el).find('h3 a').attr('href'), BASE).href,
};
}).get();
}
function nextPageUrl($, currentUrl) {
const href = $('li.next a').attr('href');
return href ? new URL(href, currentUrl).href : null;
}
const start = performance.now();
const all = [];
let url = BASE;
let pages = 0;
while (url) {
const $ = await fetchPage(url);
all.push(...extractBooks($));
url = nextPageUrl($, url);
pages++;
}
const elapsed = ((performance.now() - start) / 1000).toFixed(1);
console.log(`scraped ${all.length} books across ${pages} pages in ${elapsed}s`);
writeFileSync('books.json', JSON.stringify(all, null, 2));
const escape = v => `"${String(v ?? '').replace(/"/g, '""')}"`;
const header = ['title', 'price', 'currency', 'rating', 'link'];
const csv = [
header.join(','),
...all.map(b => header.map(k => escape(b[k])).join(',')),
].join('\n');
writeFileSync('books.csv', csv);
console.log('wrote books.json and books.csv');Output:
$ node scrape-books-static.js
scraped 1000 books across 50 pages in 16.3s
wrote books.json and books.csv16 seconds sequential is the floor for this approach. Concurrency cuts that to 2.5s without changing the scraping code, just the way pages are queued.
Storage when JSON files run out
JSON and CSV cover the common cases. SQLite is the next step when you run the scrape regularly and want to query the data without parsing the file every time (better-sqlite3 is the lightweight Node binding). MongoDB fits when records have ragged shapes (some products have a discount field, some don’t). PostgreSQL fits when you want strict schemas and joins. All three have first-party Node drivers, so switching from a JSON file to a database changes one function.
Hit the JSON endpoint before reaching for a browser
When probe.js says YOU NEED A BROWSER, look for the JSON before you launch one. JS-rendered pages usually get their data from one of two places. An XHR call that returns JSON (the page renders that JSON in the browser), or an inline <script> tag that ships the data already (the page reads it on load). Both paths skip the browser.
I measured both paths on quotes.toscrape.com/js, 10 runs each.
| Path | Median time |
|---|---|
| Playwright (render + extract) | 3038 ms |
fetch + parse inline JSON | 277 ms |
11x faster, with no Playwright launch in between.
Finding the endpoint in DevTools
Open the target in Chrome, hit DevTools, switch to Network, and filter to Fetch/XHR. Reload the page. Most pages make one or two requests that return JSON, often with names like /api/products, /graphql, or /_next/data/.... Click one and check the Response tab. If your target data is there, that is the endpoint.

Some pages skip the XHR and embed the data in a <script> tag instead. These don’t show up in Network because the data ships inside the HTML response. Search view-source for one of these shapes.
<script id="__NEXT_DATA__" type="application/json">...</script>(Next.js)<script>window.__NUXT__ = ...</script>(Nuxt)<script>var data = [...]</script>(thequotes.toscrape.com/jscase)

Hitting an XHR endpoint
Right-click the request in Network and pick Copy, Copy as fetch. Chrome puts a working fetch() call on your clipboard with the headers and cookies the page used. Drop that into your scraper. The pattern looks like this.
const res = await fetch('https://hacker-news.firebaseio.com/v0/topstories.json', {
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 15_7_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
'Accept': 'application/json',
'Accept-Language': 'en-US,en;q=0.9',
},
});
const topStoryIds = await res.json();
console.log(topStoryIds.slice(0, 5));Output:
[ 48500012, 48498385, 48497609, 48490024, 48500404 ]For pages gated behind a login, the copied request includes a Cookie header. Paste it in verbatim while you test, then check whether the app exposes an API token instead. Cookies expire. API tokens are usually more durable and easier to put in a config file.
Pulling inline JSON out of the HTML
For the inline case, fetch the page and pull the data straight out of the script tag. The quotes.toscrape.com/js page ships its quotes in a var data = [...] literal.
const res = await fetch('https://quotes.toscrape.com/js/');
const html = await res.text();
const match = html.match(/var\s+data\s*=\s*(\[[\s\S]*?\]);/);
const quotes = JSON.parse(match[1]);
console.log(quotes.length, quotes[0]);Output:
10 {
tags: [ 'change', 'deep-thoughts', 'thinking', 'world' ],
author: {
name: 'Albert Einstein',
goodreads_link: '/author/show/9810.Albert_Einstein',
slug: 'Albert-Einstein'
},
text: '"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."'
}For Next.js apps with __NEXT_DATA__, swap the regex for a Cheerio selector.
import * as cheerio from 'cheerio';
const $ = cheerio.load(html);
const nextData = JSON.parse($('#__NEXT_DATA__').text());
// nextData.props.pageProps holds the page's full data treeScraping with headless browsers
When fetch returns empty and the page has no JSON endpoint, you need a headless browser.
Playwright
Chromium installs once, then the same API works across every script.
npx playwright install chromiumA minimal scrape of quotes.toscrape.com/js.
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/js/', { waitUntil: 'networkidle' });
await page.waitForSelector('.quote');
const quotes = await page.$$eval('.quote', els =>
els.map(el => ({
text: el.querySelector('.text')?.textContent,
author: el.querySelector('.author')?.textContent,
}))
);
await browser.close();
console.log(quotes.slice(0, 2));Output:
[
{
text: '"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."',
author: 'Albert Einstein'
},
{
text: '"It is our choices, Harry, that show what we truly are, far more than our abilities."',
author: 'J.K. Rowling'
}
]Three things in this scraper that matter.
waitUntil: 'networkidle'waits until the page stops making network requests. For most JS-rendered pages this is more reliable than'load', which fires when the initial HTML is parsed, not when the data is in.waitForSelector('.quote')blocks until the selector appears in the DOM. Without it, the script races the framework’s render and reads an empty page.page.$$evalruns the callback inside the page context. Eachelis a real DOM element, soquerySelectorandtextContentwork as in the browser.
When a selector might not exist on a given page, wrap the wait in a short-timeout try/catch.
try {
await page.waitForSelector('.quote', { timeout: 5000 });
// extract
} catch {
console.warn(`no .quote on ${url}, skipping`);
}5 seconds is a reasonable cap. Playwright’s default is 30 seconds, which is too long for a scraper that hits hundreds of pages and finds the occasional missing one.
Puppeteer
Puppeteer is the older Chromium-focused library. The API is similar enough that translating the same scraper takes minutes.
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/js/', { waitUntil: 'networkidle0' });
await page.waitForSelector('.quote');
const quotes = await page.$$eval('.quote', els =>
els.map(el => ({
text: el.querySelector('.text')?.textContent,
author: el.querySelector('.author')?.textContent,
}))
);
await browser.close();
console.log(quotes.slice(0, 2));Two differences from the Playwright version. networkidle0 instead of networkidle (waits for zero in-flight requests, slightly stricter than Playwright’s threshold). And puppeteer ships its own Chrome binary by default, downloaded on install (~170 MB).
For a new scraper, Playwright is the better choice. For an existing Puppeteer codebase, there is no urgency to migrate.
Playwright vs Puppeteer in 2026
| Playwright | Puppeteer | |
|---|---|---|
| Browsers | Chromium, Firefox, WebKit | Chromium |
| Auto-wait on common actions | Yes | No (explicit waitForSelector) |
| Cold start, median (my machine) | 1423 ms | 2210 ms |
| Maintainer | Microsoft | Chrome DevTools team |
| Stealth plugin | playwright-extra | puppeteer-extra |
Cold start measured by spawning a fresh Node process per iteration, importing the library, launching the browser, opening a page, navigating to about:blank, and closing. Five runs each, median reported.
The 800ms cold-start gap adds up across thousands of pages and tilts the choice for a new scraper toward Playwright.
The challenges of web scraping at scale
The scraper that works on one page hits walls somewhere between request one and request ten thousand. Four shapes are common.
IP blocking shows up first. After enough requests from the same address, the target either blocks the IP outright or starts returning 403 and 429 responses. Bigger targets behind Cloudflare or Akamai block at the edge, before your request reaches the application.
Bot detection works even when the IP is clean. The page inspects your request and decides you are not a real browser. Common signals include missing sec-ch-ua headers, browser JavaScript globals that look wrong (like navigator.webdriver or chrome.runtime), TLS fingerprint, and timing patterns.
CAPTCHAs are a special case of bot detection. The page shows you a Cloudflare Turnstile, hCaptcha, or similar challenge. You cannot get past it without a solver service or an actual user.
Rate limiting comes from the application layer rather than the bot wall. The target serves your requests but slows them down or returns 429 after some threshold. The threshold itself can be per-IP, per-account, or per-region.
Getting past anti-bot blockers
Each failure mode has a known fix. The fixes range in cost from one line of headers to a dedicated proxy budget.
Realistic headers
A fetch that only sends User-Agent is a giveaway against any modern detector. The 12 headers Chrome sends on macOS look like this.
const headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 15_7_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'sec-ch-ua': '"Chromium";v="148", "Google Chrome";v="148", "Not/A)Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
};The sec-ch-ua-platform value has to match the UA. A macOS UA with "Windows" here is a detection signal on its own. Chrome ships a new major version every 4-6 weeks, so update the version numbers occasionally and keep them in sync between User-Agent and sec-ch-ua.
Stealth plugin for the browser path
Headers cover the static path. When you are using Playwright, the page’s JavaScript inspects the browser itself. navigator.webdriver, chrome.runtime, navigator.plugins.length, and the WebGL vendor string all leak the automation signal on default Chromium.
playwright-extra accepts plugins. puppeteer-extra-plugin-stealth (despite the name, works with Playwright too) patches the common leaks.
import { chromium } from 'playwright-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
chromium.use(StealthPlugin());
const browser = await chromium.launch();
const page = await browser.newPage();
// ... rest is identical to plain PlaywrightMeasured against bot.sannysoft.com, which runs 11 standard automation checks.

Sannysoft is the baseline. Modern anti-bot stacks (Cloudflare, PerimeterX, DataDome) catch more than this, but the stealth plugin handles the signals every check looks at first.
Cloudflare Turnstile
Turnstile and the older interstitial Cloudflare protections (the “checking your browser” page, the 1020 challenge) are the most common walls a scraper hits. The minimum that works reliably has three parts.
- A residential or mobile IP. Datacenter IPs are flagged on contact.
- The stealth plugin from above.
- Patient timing. Pause for the challenge to resolve. Don’t pound the page on retry.
Some pages require executing the Turnstile challenge inside a Chrome instance. Headless Chromium with stealth handles many of them, not all. The cases the stealth plugin doesn’t cover need either a CAPTCHA solver service or a managed scraping API.
Proxy rotation and retries
For IP blocking, run requests through a pool of proxies and swap to the next one on failure. Residential proxies (actual consumer IPs) cost more but get blocked less. Datacenter proxies are cheaper and fine for relaxed targets.
fetch in Node 22 does not accept a proxy option directly. The cleanest way to add one is through undici, which is the HTTP layer under native fetch.
npm install undiciMinimal example:
import { fetch, ProxyAgent } from 'undici';
const proxies = process.env.PROXIES.split(',');
// each entry: http://user:pass@host:port
let cursor = 0;
const nextProxy = () => proxies[cursor++ % proxies.length];
async function fetchWithProxy(url, headers, maxAttempts = 6) {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
const dispatcher = new ProxyAgent(nextProxy());
try {
const res = await fetch(url, { headers, dispatcher });
if (res.ok) return await res.text();
if (![403, 429, 503].includes(res.status)) {
throw new Error(`${res.status} ${res.statusText} for ${url}`);
}
} catch (err) {
if (attempt === maxAttempts) throw err;
}
await new Promise(r => setTimeout(r, Math.min(1000 * 2 ** (attempt - 1), 16000)));
}
throw new Error(`out of attempts for ${url}`);
}Three details that matter.
- The backoff caps at 16 seconds. Beyond that you are not retrying, you are waiting forever.
- Rotation runs round-robin (
cursor++ % proxies.length). For uneven proxy quality, weight the rotation by recent success rate instead. - The retry loop separates transient errors (403, 429, 503) from real failures. A 404 means the URL is wrong, and rotating the proxy will not fix it.
All of this (headers, stealth, proxies, retries) is the DIY path. If you’d rather skip the infrastructure, HasData’s Web Scraping API handles rotating proxies, CAPTCHA solving, and JS rendering. Pass it a URL and get HTML or structured data back.
Scaling with concurrency
Concurrency in a scraper is a promise pool. Hold a fixed number of in-flight requests and start a new one as soon as one finishes. p-limit does exactly that in one line.
import pLimit from 'p-limit';
const limit = pLimit(5);
const results = await Promise.all(
urls.map(url => limit(() => fetchPage(url)))
);The shape is Promise.all over the URL list, with each call wrapped in limit(). The pool keeps 5 fetches in flight at any moment, queuing the rest, and the whole thing resolves when the last page finishes.
The concurrency benchmark
I ran the books.toscrape.com capstone at concurrency 1, 3, 5, 10, and 20. Single run per level, 50 pages of 20 books each.

The curve flattens hard after concurrency 5. Doubling from 5 to 10 buys 1.1 seconds. Going from 10 to 20 buys 0.3 seconds. Past that, you are not getting faster, you are getting closer to the slowest single request in the batch.
Sequential cost equals the sum of all request times. Concurrent cost equals the maximum of any chunk of N requests. Once N is large enough that one chunk covers your slowest pages, more parallelism adds zero throughput. For books.toscrape.com that point is somewhere around 10.
The practical recommendation is 5 to 10 for most targets. Higher than that, the target will start rate-limiting you or noticing the burst of requests from one IP and cutting it off.
CPU-bound parsing with worker_threads
When the work is I/O bound (HTTP requests waiting on the network), p-limit is enough. CPU-bound work, like parsing thousands of HTML pages, running heavy regex, or decompressing large responses, runs on the single main thread, and while it runs no new fetches can start. worker_threads (built into Node) spawns extra threads that handle parsing while the main thread keeps fetching.
In practice this matters for scrapes over 10,000 pages where parsing time approaches or exceeds fetch time. For the books.toscrape.com benchmark (1000 pages, around 50ms parse per page), worker_threads doesn’t change the total time.
p-limit and worker threads scale the scraper itself. At some point the target becomes the limit (per-IP rate caps, geo-blocking, CAPTCHAs at high volume). HasData’s Web Scraping API handles the target-side scaling with rotating residential proxies across geographies and built-in retry on transient blocks.
JavaScript vs Python for web scraping
Stay on NodeJS for scraping. Cheerio matches BeautifulSoup. Python’s Playwright is a Python binding around a Node process. asyncio.gather plus a semaphore is the verbose version of p-limit.
The one case for Python is downstream pandas/scikit-learn. Solve that with a JSON or CSV export from your Node scraper, not by porting everything to Python.
Conclusion
The decision order on any new target.
- Run
probe.js. CHEERIO WORKS meansfetch + Cheeriois enough, ship it. - YOU NEED A BROWSER means open DevTools Network and look for a JSON endpoint. If one is there, hit it with
fetchdirectly. About 10x faster than rendering. - If there is no endpoint, launch Playwright with
waitForSelector. Wrap in try/catch for missing-selector pages. - When request volume grows, add
p-limitat concurrency 5-10. Going higher buys little speed and adds rate-limit risk. - When IPs get blocked or Cloudflare shows up, add the stealth plugin, residential proxies, and retry-with-backoff.
That is the flow. And five packages on Node 22+ cover it.
FAQ
What’s the best library for web scraping in Node.js?
The right choice depends on whether the page is static or dynamic. For static HTML, fetch + Cheerio is the smallest stack that gets the job done. For JS-rendered pages, Playwright. Run probe.js against the target to know which path applies.
Should I use Cheerio or Puppeteer?
This is the static-versus-dynamic question, not a library preference. Cheerio parses HTML that fetch already downloaded. Puppeteer (or Playwright) is for pages that need a browser to render. Use probe.js to pick.
Axios vs Puppeteer?
These solve different problems. axios is an HTTP client like fetch, for static pages. Puppeteer (or Playwright) is for JS-rendered pages. They sit at different layers of the same scraper.
How do I avoid getting blocked?
Three layers, applied as needed. Send realistic Chrome headers including sec-ch-ua. Add the stealth plugin when the target inspects browser globals. Switch to residential proxies with retry on 403 / 429 when the IP itself gets banned.


