Web Scraping with Rust: A Complete Guide for Beginners

Valentina Skakun Valentina Skakun
Last update: 30 Apr 2024

Rust is a fast programming language similar to C, which is suitable for creating system programs (drivers and operating systems), as well as regular programs and web applications. Choose Rust as a programming language for making a web scraper when you need more significant and lower-level control over your application. For instance, if you want to track used resources, manage memory, and do much more.

In this article, we will explore the nuances of building an efficient web scraper with Rust, highlighting its pros and cons at the end. Whether you are tracking real-time data changes, conducting market research, or simply collecting data for analysis, Rust’s capabilities will allow you to build a web scraper that is both powerful and reliable.

Getting Started with Rust

To install Rust, go to the official website and download the distribution (for Windows operating system) or copy the install command (for Linux).

Install Rust

When you run the file for Windows, a command prompt will open, and an installer will offer you a choice of one of three functions:

Windows Install

As we don’t want to configure the dependencies manually, we select option 1 for automatic installation. The installation will then be complete, and you will see a message saying that Rust and all the necessary components have been successfully installed.

To install the default components on a Linux system, enter in the terminal the command:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then, select item 1 during the installation process.

Linux Install

You can also update Rust in the terminal:

$ rustup update

Check the updates:

$ rustc --version

And uninstall it:

$ rustup self uninstall

The installation and setup process is finished now. Make a new file with the rs file to create a Rust script. You can also use Cargo, Rust’s package manager, to create a new project. Use this command:

cargo new project_name

As usual, we use Visual Studio Code to write the code. We will also use the Rust Analyzer plugin to make things easier.

Tools and Libraries for Rust Web Scraping

Unlike Python, there are only a few scraping libraries in Rust, but the most popular ones are:

  1. Reqwest and Scraper. Execute a request to a web page and parse the result. Suitable for static pages only.

  2. Headless_Chrome. Allows you to use a headless browser and automate actions on the page (form filling, clicks, etc.). It has similar functionality to Puppeteer for NodeJS and Selenium for Python.

Let’s take a closer look at them.

Reqwest

Reqwest is a very simple asynchronous library for the Rust programming language to handle HTTP requests. It provides a convenient and efficient way to send HTTP requests to remote servers and process HTTP responses.

To use it in your project, you need to install a dependency:

cargo add reqwest --features "reqwest/blocking"

After that, you can use it in your project and configure various aspects of requests, such as headers, parameters, and authorization.

Scraper

This Rust library, unlike the previous one, can’t do requests. However, it’s good at getting data from XML and HTML document. Because of this, these libraries are usually used together.

To connect the needed parts and use the library in a rust project, use this command:

cargo add scraper

After that, you can parse the needed HTML document using CSS selectors.

Headless_Chrome

And the last library is “headless_chrome” for Rust. It works with Chrome browser in “Headless” mode for web scraping and automation. To use it, connect dependencies with the following command:

cargo add headless_chrome

The “headless_chrome” Rust library lets you control the Chrome browser through the DevTools protocol. It gives a Rust interface for sending commands to the browser, like loading web pages, running JavaScript, simulating events, and more.

Basic Web Scraping in Rust

To make using libraries easier, let’s look at a simple example of scraping with them. As an example, we will scrape the demo opencart website.

Store Research

We’ve discussed elements on this page so that we won’t dwell on them again.

Making an HTTP request

We will work with the file automatically created using “cargo new project_name.” The main.rs file is located in the project folder in the src subfolder. It is generated immediately with a sample function to display “Hello world!“.

For the web scraper, we will use the automatically created function to write the code of our Rust scraper.

fn main() {
    // Here will be code
}

First, let’s start getting the website HTML. Use this command to send a request:

    let response = reqwest::blocking::get("https://demo.opencart.com/");

And then this one to extract the resulting answer:

    let data = response.unwrap().text().unwrap();

If you want to test how this works, add a screen output command:

    println!("{data}");

To build and run the project, use at the command prompt:

cargo build
cargo run

The runtime will give you all the HTML code of the page.

Parsing HTML Document

For parsing data, we will use the Scraper library and its ability to extract data using CSS selectors. For this, we need a structure and array to store the data:

    struct DemoProduct {
        image: Option<String>,
        url: Option<String>,
        title: Option<String>,
        description: Option<String>,
        new: Option<String>,
        tax: Option<String>,
    }
    let mut demo_products: Vec<DemoProduct> = Vec::new();

Then use select method to extract information about all the products:

    let html_product_selector = scraper::Selector::parse("div.col").unwrap();
    let html_products = document.select(&html_product_selector);

Finally, let’s go through each product and use CSS selector to store the data from html elements we need in an array.

    for html_product in html_products {
        let image = html_product
            .select(&scraper::Selector::parse(".image a").unwrap())
            .next()
            .and_then(|a| a.value().attr("href"))
            .map(str::to_owned);
        let url = html_product
            .select(&scraper::Selector::parse("h4 a").unwrap())
            .next()
            .and_then(|a| a.value().attr("href"))
            .map(str::to_owned);
        let title = html_product
            .select(&scraper::Selector::parse(".description h4").unwrap())
            .next()
            .map(|h4| h4.text().collect::<String>());
        let description = html_product
            .select(&scraper::Selector::parse(".description p").unwrap())
            .next()
            .map(|p| p.text().collect::<String>());
        let new = html_product
            .select(&scraper::Selector::parse(".price-new").unwrap())
            .next()
            .map(|price| price.text().collect::<String>());
        let tax = html_product
            .select(&scraper::Selector::parse(".price-tax").unwrap())
            .next()
            .map(|price| price.text().collect::<String>());
        let demo_product = DemoProduct {
            image,
            url,
            title,
            description,
            new,
            tax,
        };
        demo_products.push(demo_product);
    }

To display this data on the screen, you must go through all the elements again and display the entire array line by line:

    for (index, product) in demo_products.iter().enumerate() {
        println!("Product #{}", index + 1);
        println!("Image: {:?}", product.image);
        println!("URL: {:?}", product.url);
        println!("Title: {:?}", product.title);
        println!("Description: {:?}", product.description);
        println!("New Price: {:?}", product.new);
        println!("Tax: {:?}", product.tax);
        println!("-----------------------------");
    }

But since we rarely need to display the data on the screen, let’s save the data we get to a CSV file.

Saving scraped data to a CSV file

There is a CSV library for working with Excel files. To install it, we will use Cargo:

cargo add csv

Now let’s return to our script. Specify the path to save the file and the required columns.

    let mut csv_writer = csv::Writer::from_path("products.csv").unwrap();
    csv_writer.write_record(&["Image", "URL", "Title", "Description", "New Price", "Tax"]).unwrap();

Then process each element of the array line by line and put it into a file:

    for product in demo_products {
        let image = product.image.unwrap();
        let url = product.url.unwrap();
        let title = product.title.unwrap();
        let description = product.description.unwrap();
        let new = product.new.unwrap();
        let tax = product.tax.unwrap();

        csv_writer.write_record(&[image, url, title, description, new, tax]).unwrap();
    }

And finish working on the file:

    csv_writer.flush().unwrap();

You can use println! and print out the data about the end of the script, or you can just wait for its execution. As a result, you will get a file with the following output:

CSV Result

Congratulations, you’ve just built a classic web scraper with Rust, taking full advantage of its memory safety and speed. You’ve learned how to make requests, parse HTML documents, and even write the scraped data to a CSV file — all within Rust’s ecosystem.

However, this method is suitable only for static web pages and does not allow you to work with dynamic content. It also has a very high risk of blocking, as web sites can easily recognize your scraper. To solve these difficulties, you can use headless browsers.

Dealing with Dynamic Content

Let’s extend our example. We’ll use a library that uses a headless browser to navigate to a page and collect data. This will solve some problems, as well as get dynamic content off the page.

You will need an appropriate library, such as headless_chrome, to use the browser. The script will be similar to the first example, except that the transition to the page and the collection of the page’s HTML code will be handled by the headless_chrome library. Create a new rust project with simular code to the first example. Then add in main function:

    let browser = headless_chrome::Browser::default().unwrap();
    let tab = browser.new_tab().unwrap();
    tab.navigate_to("https://demo.opencart.com/").unwrap();

The processing of data using CSS selectors is also slightly different.

    let html_products = tab.wait_for_elements("div.col").unwrap();
    for html_product in html_products {
        let image = html_product
            .wait_for_element("image a")
            .unwrap()
            .get_attributes()
            .unwrap()
            .unwrap()
            .get(1)
            .unwrap()
            .to_owned();
        let url = html_product
            .wait_for_element("h4 a")
            .unwrap()
            .get_attributes()
            .unwrap()
            .unwrap()
            .get(1)
            .unwrap()
            .to_owned();
        let title = html_product
            .wait_for_element(".description h4")
            .unwrap()
            .get_inner_text()
            .unwrap();
        let description = html_product
            .wait_for_element(".description p")
            .unwrap()
            .get_inner_text()
            .unwrap();
        let new = html_product
            .wait_for_element(".price-new")
            .unwrap()
            .get_inner_text()
            .unwrap();
        let tax = html_product
            .wait_for_element(".price-tax")
            .unwrap()
            .get_inner_text()
            .unwrap();

        let demo_product = DemoProduct {
            image: Some(image),
            url: Some(url),
            title: Some(title),
            description: Some(description),
            new: Some(new),
            tax: Some(tax),
        };
        demo_products.push(demo_product);
    }

Everything else remains the same as in the first example. This method, however, also has a few challenges. While this approach gives you granular control over the scraping process, it can be overwhelming for beginners or those who need quick and simplified solutions. That’s where specialized web scraping APIs like HasData come in.

These APIs offer the advantage of handling many complexities for you, such as rotating proxies, handling CAPTCHAs, and managing browser sessions, allowing you to focus more on the data you need rather than the intricacies of the scraping process.

If this sounds like an appealing alternative, let’s dive into how you can use these APIs for your web scraping needs.

Web Scraping in Rust Using API

Before we dive into the code, let’s briefly touch on what web scraping APIs are. Essentially, these are specialized services designed to simplify the process of web scraping by taking care of the complexities involved, such as handling CAPTCHAs, rotating proxies, and managing headless browsers.

For a more detailed overview, you may refer to our separate article on the subject.

Now, let’s see how we can integrate one such API into our Rust project. Since the API returns data in JSON format, you need the optional serde library and its serde_json module. You can also connect them using Cargo:

cargo add serde
cargo add serde_json

Then, you will need the API key, which you can find on the dashboard tab in your account after registering at HasData. Let’s create a new project and set the client object and request headers in the main function:

    let client = Client::builder().build()?;

    let mut headers = HeaderMap::new();
    headers.insert("x-api-key", HeaderValue::from_static("YOUR-API-KEY"));
    headers.insert("Content-Type", HeaderValue::from_static("application/json"));

Then, let’s use the ability to add extraction rules and add them to the body of the query to get the required data at once:

    let mut extract_rules = HashMap::new();
    extract_rules.insert("Image", "div.image > a > img @src"); // Use space to identify src or href attribute
    extract_rules.insert("Title", "h4");
    extract_rules.insert("Link", "h4 > a @href");
    extract_rules.insert("Description", "p");
    extract_rules.insert("Old Price", "span.price-old");
    extract_rules.insert("New Price", "span.price-new");
    extract_rules.insert("Tax", "span.price-tax");

    let extract_rules_json: Value = serde_json::to_value(extract_rules)?;

Set the rest of the query parameters:

    let data = json!({
        "extract_rules": extract_rules_json,
        "url": "https://demo.opencart.com/"
    });

And make a POST request to the API:

    let request = client.post("https://api.hasdata.com/scrape")
        .headers(headers)
        .body(serde_json::to_string(&data)?);
    let response = request.send()?;

Now script return response and you can get the data you need:

    let body = response.text()?;

Print it on the screen:

    println!("{}", body);

Or, use the previous examples and save this data to a CSV File.

Web Crawling with Rust

The last code example in this article will be a simple crawler that recursively traverses a website’s pages and collects all the links. We have already written about the difference between a scraper and a crawler, so we will not compare them here.

Let’s explicitly import the required modules in the file:

use reqwest::blocking::Client;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;

Then in the main function, we set the initial parameters and call the crawl function, which will bypass the links.

fn main() {
    let client = Client::new();
    let start_url = "https://demo.opencart.com/";
    let mut visited_links = HashSet::new();
    crawl(&client, start_url, &mut visited_links).unwrap();
}

Finally, there is the crawl function, where we check to see if we have traversed the current link, and if we have not, we perform a crawl.

fn crawl(client: &Client, url: &str, visited_links: &mut HashSet<String>) -> Result<(), reqwest::Error> {
    if visited_links.contains(url) {
        return Ok(());
    }
    visited_links.insert(url.to_string());

    let res = client.get(url).send()?;
    if !res.status().is_success() {
        return Ok(());
    }
    let body = res.text()?;
    let document = Document::from(body.as_str());
    for link in document.find(Name("a")) {
        if let Some(href) = link.attr("href") {
            if href.starts_with("http") && href.contains("demo.opencart.com") {
                println!("{}", link.text());
                println!("{}", href);
                println!("---");
                crawl(client, href, visited_links)?;
            }
        }
    }
    Ok(())
}

The result:

Crawler Result

This way, you can traverse all the site’s pages and output them to the console or use the skills you’ve learned and save them to CSV.

Pros and Cons of Using Rust for Web Scraping

Rust has several disadvantages as well as many advantages. Despite the difficulty of learning and limited resources for learning scraping in Rust, it is vital due to its high performance and the ability to manage low-level processes.

Concurrency is another strength of Rust, enabling programmers to write concurrent programs that efficiently utilize system resources. Moreover, the Rust community provides vital support and collaboration opportunities. Developers can freely share their knowledge and learn from others through various forums, chat rooms, and online communities.

Nevertheless, there are a few drawbacks to consider when working with Rust. Regarding browser automation specifically, while it is possible in Rust to use external tools or crates like Selenium WebDriver bindings for other languages (not native), direct support within the pure Rust ecosystem remains relatively limited compared to some alternatives supported directly by browsers.

You can take a look at the table to review the main advantages and disadvantages of Rust:

ProsCons
1. Performance1. Learning Curve
2. Memory Safety2. Ecosystem Maturity
3. Concurrency3. Limited Browser Automation
4. Community Support4. Documentation and Resources
5. Integration with Other Rust Libraries5. Less Tooling

Overall, these aspects demonstrate how Rust addresses essential pain points, addressing safety concerns and better performance potential, making it an increasingly attractive choice for many modern application domains.

Challenges of Web Scraping in Rust

Web scraping can be challenging, especially when using the Rust language. Let’s explore some of the difficulties web scraper face in Rust and discuss possible solutions.

Limited Ecosystem and Library Availability

One of the challenges of web scraping with Rust is the limited availability of libraries and tools specifically tailored for scraping. Rust has a less extensive ecosystem for web scraping than other popular languages like Python or JavaScript. As a result, developers may need to spend more time building their own custom scraping utilities or working with existing but less comprehensive libraries.

Dynamic Content and JavaScript

Many modern websites use dynamic content generated through JavaScript execution on the client side. This dynamic content poses another hurdle in web scraping because more than traditional HTML parsing is required to extract all desired information accurately.

To overcome this limitation, one solution is to leverage headless browsers with WebDriver bindings such as Headless_Chrome. This tool enables scripted interaction with websites as if you controlled an actual browser, allowing you to scrape dynamic content effectively.

CAPTCHA, Access Restrictions, and IP Blocking

We have already written about these difficulties and how to avoid them in another article, so here we will say very briefly. To protect their data and prevent scraping, websites employ measures like CAPTCHAs, access restrictions based on user-agent or IP addresses, or even temporarily blocking suspicious activities. Dealing with these obstacles can be tricky during web scraping with Rust.

There are a few strategies to mitigate these challenges, such as rotating IP addresses using proxy servers or employing libraries that help bypass CAPTCHA. By understanding and addressing these challenges head-on, developers can harness the power of Rust’s safety guarantees while building effective and efficient web scrapers.

Conclusion and Takeaways

While web scraping with Rust might present some challenges due to the language’s limited ecosystem for this specific use case, overcoming them by exploring available libraries and implementing appropriate techniques like asynchronous programming or leveraging headless browsers is still highly feasible. By understanding the hurdles involved and applying suitable solutions, developers can successfully undertake web scraping tasks in Rust efficiently.

Blog

Might Be Interesting