Web Scraping with Node.js: How to Leverage the Power of JavaScript
NodeJS is a JavaScript runtime environment built on top of the V8 JS engine developed by Google. But above all, Node.js is a platform for building web applications. Like JavaScript, it is ideal for solving web tasks.
There are several web scraping libraries for Node.js: Axios, SuperAgent, Cheerio, and Puppeteer with headless browsers.
Advantages of using Node.js for Web Scraping
Our company uses a JavaScript + NodeJS + MongoDB stack in a Linux shell for web scraping. The connecting link is NodeJS, which has a number of undeniable advantages.
Firstly, NodeJS as a runtime is efficient due to its support for asynchronous I/O operations. This speeds up the application for HTTP requests and database requests in areas where the main thread of execution does not depend on the results of I/O.
Secondly, NodeJS supports streaming data transfer (Stream), which helps to process big files (or data) even with minimal system requirements.
Thirdly, NodeJS contains a lot of built-in modules that help to interact with the operating system and the web. For example, FileSystem and Path for data input/output procedures on the system disk, URL for manipulating route parameters and query parameters in URL, Process and Child processes - for managing operating system processes serving crawlers, and also Utils, Debugger, and so on.
Fourth, the NodeJS ecosystem contains a huge number of packages from the developer community, which can help to solve almost any problem. For example, for scraping, there are such libraries as Axios, SuperAgent, Cheerio, and Puppeteer. And if you want to scrape Google using these libraries, we suggest you to read our article: “Web Scraping Google with Node JS”.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
HTTP requests in NodeJS using Axios
Axios is a promisified HTTP client.
Little is needed from Axios for scraping - the majority of requests are sent with the GET
method.
Lets use:
const axios = require('axios');
const UserAgent = require('user-agents');
const axios = axios.create({
headers: {
'user-agent': (new UserAgent()).toString(),
'cookie': process.env.COOKIE,
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
},
timeout: 10000,
proxy: {
host: process.env.PROXY_HOST,
port: process.env.PROXY_PORT,
auth: {
username: process.env.PROXY_USERNAME,
password: process.env.PROXY_PASSWORD,
}
},
})
Here, an axios
instance is created with an example configuration.
User-agents library generates actual values for the header of the same name, so as not to enter these values manually.
Cookies are received using a separate script, the task of which is to launch a Chromium instance using the Puppeteer library, authenticate on the target website, and pass the cookie value cached by the browser to the application.
In the example, the cookie property is bound to the value of the environment variable process.env.COOKIE
. This implies that the actual cookie has been placed in an environment variable, for example with the pm2 process manager.
Although the cookie value (which is just a string) can be set directly in the configuration above by copying it from the browser developer panel.
It remains to send an http request and get the content in the data
property of the Response
object. This is usually html, but can also be any other format of the required data.
const yourAsyncFunc = async () => {
const { data } = await axios.get(targetUrl); // data --> <!DOCTYPE html>... and so on
// some code
}
SuperAgent for parsing in NodeJS
As an alternative to Axios, there is a lightweight SuperAgent http client. One can make a simple GET
request like this:
const superagent = require('superagent');
const yourAsyncFunc = async () => {
const { text } = await superagent.get(targetUrl); // text --> page content
// some code
}
It has a good reputation for building web applications that use AJAX. A feature of the SuperAgent is the pipelined way of setting the request configuration:
const superagent = require('superagent');
superagent
.get('origin-url/route')
.set('User-Agent': '<some UA>')
.query({ city: 'London' })
.end((err, res) => {
// // Calling the end function will send the request
const { text } = res; // text --> page content
// some other code
});
One can add a proxy service to requests using the superagent-proxy wrapper. Also SuperAgent supports async/await
syntax.
Structure Transformation with Cheerio
Cheerio is an indispensable tool for converting string HTML content into a tree structure, then traversing it and extracting the required data.
Scraping mostly uses Load API and Selectors API.
Loading:
const cheerio = require('cheerio');
// somewhere inside asynchronous func...
const { data } = await fetchPageContent(url);
const $ = cheerio.load(data); // now we've got cheerio node tree in '$'
The next steps are to select nodes by selectors and extract the required data.
Here is an example using cheerio, which filters all nodes matching a combination of selectors into a collection and extracts the links they contain:
const urls = $('.catalog li :nth-child(1) a')
.filter((i, el) => $(el).attr('href'))
.map((i, el) => $(el).attr('href'))
.toArray();
// urls --> ['url1', 'url2', ..., 'urlN']
That tells:
find an element with the
.container
class in the markup (at any nesting level);select all elements with
li
tag (at any nesting level within.container
);in each of the
li
elements, select the first child element;in the first child element, select all elements with
a
tag (at any nesting level);filter only those
a
elements that containhref
attribute;iterate over the resulting collection of
a
elements and extract the values from thehref
attribute;write the received links to the JS array.
This code contains only 4 short lines but has a high instruction density.
It is worth noting that if cheerio does not return the required content fragments, check that the HTML markup received from the web server really contains what is needed.
Headless Browsers with Puppeteer
Headless browsers are used to simulate user actions on a page programmatically without downloading a GUI instance.
Using a headless browser consumes system resources in one way or another and increases the overall running time of the application. In addition, care must be taken to ensure that processes with browser instances do not remain open in the system, as their uncontrolled growth will bring down the entire server.
Of the “headless” ones, the most used is Chromium, managed using the Puppeteer library, and the most common reason for its appearance in the scraper code is a pop-up captcha (or a requirement to execute some kind of JS code before loading content). The browser receives the task from the captcha, waits for a solution, and sends a response with the solution to the web server. Only after that, the calling code receives HTML for parsing.
Sometimes a headless browser is used to receive an authorization cookie and, in very rare cases, to load content by simulating mouse scrolling.
Using:
// puppeteer-service.js
const puppeteer = require('puppeteer');
module.exports = async () => {
try {
const browser = await puppeteer.launch({
args: ['--window-size=1920,1080'],
headless: process.env.NODE_ENV === 'PROD', // FALSE means you can see all performance during development
});
const page = await browser.newPage();
await page.setUserAgent('any-user-agent-value');
await page.goto('any-target-url', { waitUntil: ['domcontentloaded'] });
await page.waitForSelector('any-selector');
await page.type('input.email-input', username);
await page.type('input.password-input', password);
await page.click('button[value="Log in"]');
await page.waitForTimeout(3000);
return page.evaluate(
() => document.querySelector('.pagination').innerHTML,
);
} catch (err) {
// handling error
} finally {
browser && await this.browser.close();
}
}
// somewhere in code
const fetchPagination = require('./puppeteer-service.js');
const $ = cheerio.load(await fetchPagination());
// some useful code next...
This code snippet provides a simplified example with basic methods for manipulating a Chromium page. First, a browser instance and a blank page are instantiated. After setting the header’s 'User-Agent'
the page requests content at the given url.
Next, make sure that the required piece of content was loaded (waitForSelector
), enter the login and password in the authorization fields (type
), press the ‘Log in’ button (click
), set the page to wait for 3 seconds (waitForTimeout
), during which the content of the authorized user is loaded and, finally, returned to the top the resulting HTML markup of the desired piece with pagination.
Zillow API Node.js is a programming interface that allows developers to interact with Zillow's real estate data using the Node.js platform. It provides a set of tools and methods for accessing various information from the Zillow database.
Shopify API Node.js framework can be used to create custom eCommerce solutions, integrate Shopify functionality into existing applications, and automate various aspects of their online shops.
Using Javascript’s Async for Speed Increase
The asynchronous I/O features that NodeJS and JavaScript support can be used to speed up the application. There are two conditions: free processor cores, on which one can run asynchronous processes separately from the main thread of execution, and independence of the main thread from the result of an asynchronous operation.
It is not possible to use the asynchrony of HTTP requests to speed up the running time of the process, since the continuation of the main thread directly depends on receiving a response to the HTTP request.
The same applies to operations for reading information from the database. Until the data is received, the main thread of the scraper has little to do.
But one can write data to the database, separated from the mainstream. Suppose, at some point in the code, an object is received to write to the database. Depending on the result of the recording, it is necessary to branch further actions, only then proceed to another iteration.
Instead of waiting for the result at the place where the write data function was called, one can create a new process and assign this work to it. The main thread can immediately move on.
// in some place of code...
// we've got the object for inserting to database
const data = {
product: 'Jacket',
price: 55,
currency: 'EUR',
};
const { fork } = require('child_process');
// launch the other bind process
const child = fork('path-to-write-data.js');
child.send(JSON.stringify(data));
// do the next code iteration...
Implementation of the code for the child process in a separate file:
// write-data.js
// inside acync function
process.on('message', (data) => {
const result = await db.writeDataFunc(JSON.parse(data));
if (result) {
// do some impotant things
} else {
// do other impotant things
}
})
In general, the feature of the fork method is that it allows the parent and child processes to exchange messages with each other. But, in this example, for demonstration purposes, work is delegated to the child process without notifying the parent, which allows the latter to work out its next thread of execution in parallel with the child.
Avoid Blocks while Scraping
Most of the target websites from which data is scraped actively resist this process. Of course, the developers of these resources know how it works. This means that setting just the right headers is often not enough to crawl the entire site.
Web servers can limit the distribution of content after reaching a certain number of requests from one IP per unit of time. They can restrict access if they see that the request came from a data center proxy.
They can send a captcha to solve if the location of the client’s IP seems unreliable to them. Or they may offer to execute some JS code on the page before loading the main content.
The purpose is to use web server metrics to make it look like the request came from the user’s browser and not from a bot. If it looks realistic enough, the server sends the content so that it doesn’t accidentally restrict the real user instead of the bot.
When such a picky web server comes across, the problem is solved by correctly localizing proxy addresses and gradually increasing their quality, up to residential ones. The downside of this decision is the increased cost of data collection.
You can also leverage other JavaScript libraries to enhance your scraping workflow depending on the project requirements.
Conclusion and Takeaways
NodeJS and JavaScript are great for data scraping in all parts of the process. If a stack or executor is needed, NodeJS, JavaScript, and MongoDB will be one of the best choices. The use of NodeJS allows not only to solve all possible issues in the field of scraping but also to ensure the protection and reliability of data extraction. And the use of headless browsers will provide an imitation of user behavior.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Aug 16, 2024
JavaScript vs Python for Web Scraping
Explore the differences between JavaScript and Python for web scraping, including popular tools, advantages, disadvantages, and key factors to consider when choosing the right language for your scraping projects.
- Tools and Libraries
- Python
- NodeJS
Aug 13, 2024
How to Scroll Page using Selenium in Python
Explore various techniques for scrolling pages using Selenium in Python. Learn about JavaScript Executor, Action Class, keyboard events, handling overflow elements, and tips for improving scrolling accuracy, managing pop-ups, and dealing with frames and nested elements.
- Tools and Libraries
- Python
- Tutorials and guides