Web Scraping with Rust: A Complete Guide for Beginners
Rust is a fast programming language similar to C, which is suitable for creating system programs (drivers and operating systems), as well as regular programs and web applications. Choose Rust as a programming language for making a web scraper when you need more significant and lower-level control over your application. For instance, if you want to track used resources, manage memory, and do much more.
In this article, we will explore the nuances of building an efficient web scraper with Rust, highlighting its pros and cons at the end. Whether you are tracking real-time data changes, conducting market research, or simply collecting data for analysis, Rust’s capabilities will allow you to build a web scraper that is both powerful and reliable.
Getting Started with Rust
To install Rust, go to the official website and download the distribution (for Windows operating system) or copy the install command (for Linux).
When you run the file for Windows, a command prompt will open, and an installer will offer you a choice of one of three functions:
As we don’t want to configure the dependencies manually, we select option 1 for automatic installation. The installation will then be complete, and you will see a message saying that Rust and all the necessary components have been successfully installed.
To install the default components on a Linux system, enter in the terminal the command:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Then, select item 1 during the installation process.
You can also update Rust in the terminal:
$ rustup update
Check the updates:
$ rustc --version
And uninstall it:
$ rustup self uninstall
The installation and setup process is finished now. Make a new file with the rs file to create a Rust script. You can also use Cargo, Rust’s package manager, to create a new project. Use this command:
cargo new project_name
As usual, we use Visual Studio Code to write the code. We will also use the Rust Analyzer plugin to make things easier.
Tools and Libraries for Rust Web Scraping
Unlike Python, there are only a few scraping libraries in Rust, but the most popular ones are:
Reqwest and Scraper. Execute a request to a web page and parse the result. Suitable for static pages only.
Headless_Chrome. Allows you to use a headless browser and automate actions on the page (form filling, clicks, etc.). It has similar functionality to Puppeteer for NodeJS and Selenium for Python.
Let’s take a closer look at them.
Reqwest
Reqwest is a very simple asynchronous library for the Rust programming language to handle HTTP requests. It provides a convenient and efficient way to send HTTP requests to remote servers and process HTTP responses.
To use it in your project, you need to install a dependency:
cargo add reqwest --features "reqwest/blocking"
After that, you can use it in your project and configure various aspects of requests, such as headers, parameters, and authorization.
Scraper
This Rust library, unlike the previous one, can’t do requests. However, it’s good at getting data from XML and HTML document. Because of this, these libraries are usually used together.
To connect the needed parts and use the library in a rust project, use this command:
cargo add scraper
After that, you can parse the needed HTML document using CSS selectors.
Headless_Chrome
And the last library is “headless_chrome” for Rust. It works with Chrome browser in “Headless” mode for web scraping and automation. To use it, connect dependencies with the following command:
cargo add headless_chrome
The “headless_chrome” Rust library lets you control the Chrome browser through the DevTools protocol. It gives a Rust interface for sending commands to the browser, like loading web pages, running JavaScript, simulating events, and more.
Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download results in convenient JSON, CSV, and Excel formats.
Discover the easiest way to get valuable SEO data from Google SERPs with our Google SERP Scraper! No coding is needed - just run, download, and analyze your SERP data in Excel, CSV, or JSON formats. Get started now for free!
Basic Web Scraping in Rust
To make using libraries easier, let’s look at a simple example of scraping with them. As an example, we will scrape the demo opencart website.
We’ve discussed elements on this page so that we won’t dwell on them again.
Making an HTTP request
We will work with the file automatically created using “cargo new project_name.” The main.rs file is located in the project folder in the src subfolder. It is generated immediately with a sample function to display “Hello world!“.
For the web scraper, we will use the automatically created function to write the code of our Rust scraper.
fn main() {
// Here will be code
}
First, let’s start getting the website HTML. Use this command to send a request:
let response = reqwest::blocking::get("https://demo.opencart.com/");
And then this one to extract the resulting answer:
let data = response.unwrap().text().unwrap();
If you want to test how this works, add a screen output command:
println!("{data}");
To build and run the project, use at the command prompt:
cargo build
cargo run
The runtime will give you all the HTML code of the page.
Parsing HTML Document
For parsing data, we will use the Scraper library and its ability to extract data using CSS selectors. For this, we need a structure and array to store the data:
struct DemoProduct {
image: Option<String>,
url: Option<String>,
title: Option<String>,
description: Option<String>,
new: Option<String>,
tax: Option<String>,
}
let mut demo_products: Vec<DemoProduct> = Vec::new();
Then use select method to extract information about all the products:
let html_product_selector = scraper::Selector::parse("div.col").unwrap();
let html_products = document.select(&html_product_selector);
Finally, let’s go through each product and use CSS selector to store the data from html elements we need in an array.
for html_product in html_products {
let image = html_product
.select(&scraper::Selector::parse(".image a").unwrap())
.next()
.and_then(|a| a.value().attr("href"))
.map(str::to_owned);
let url = html_product
.select(&scraper::Selector::parse("h4 a").unwrap())
.next()
.and_then(|a| a.value().attr("href"))
.map(str::to_owned);
let title = html_product
.select(&scraper::Selector::parse(".description h4").unwrap())
.next()
.map(|h4| h4.text().collect::<String>());
let description = html_product
.select(&scraper::Selector::parse(".description p").unwrap())
.next()
.map(|p| p.text().collect::<String>());
let new = html_product
.select(&scraper::Selector::parse(".price-new").unwrap())
.next()
.map(|price| price.text().collect::<String>());
let tax = html_product
.select(&scraper::Selector::parse(".price-tax").unwrap())
.next()
.map(|price| price.text().collect::<String>());
let demo_product = DemoProduct {
image,
url,
title,
description,
new,
tax,
};
demo_products.push(demo_product);
}
To display this data on the screen, you must go through all the elements again and display the entire array line by line:
for (index, product) in demo_products.iter().enumerate() {
println!("Product #{}", index + 1);
println!("Image: {:?}", product.image);
println!("URL: {:?}", product.url);
println!("Title: {:?}", product.title);
println!("Description: {:?}", product.description);
println!("New Price: {:?}", product.new);
println!("Tax: {:?}", product.tax);
println!("-----------------------------");
}
But since we rarely need to display the data on the screen, let’s save the data we get to a CSV file.
Saving scraped data to a CSV file
There is a CSV library for working with Excel files. To install it, we will use Cargo:
cargo add csv
Now let’s return to our script. Specify the path to save the file and the required columns.
let mut csv_writer = csv::Writer::from_path("products.csv").unwrap();
csv_writer.write_record(&["Image", "URL", "Title", "Description", "New Price", "Tax"]).unwrap();
Then process each element of the array line by line and put it into a file:
for product in demo_products {
let image = product.image.unwrap();
let url = product.url.unwrap();
let title = product.title.unwrap();
let description = product.description.unwrap();
let new = product.new.unwrap();
let tax = product.tax.unwrap();
csv_writer.write_record(&[image, url, title, description, new, tax]).unwrap();
}
And finish working on the file:
csv_writer.flush().unwrap();
You can use println! and print out the data about the end of the script, or you can just wait for its execution. As a result, you will get a file with the following output:
Congratulations, you’ve just built a classic web scraper with Rust, taking full advantage of its memory safety and speed. You’ve learned how to make requests, parse HTML documents, and even write the scraped data to a CSV file — all within Rust’s ecosystem.
However, this method is suitable only for static web pages and does not allow you to work with dynamic content. It also has a very high risk of blocking, as web sites can easily recognize your scraper. To solve these difficulties, you can use headless browsers.
Dealing with Dynamic Content
Let’s extend our example. We’ll use a library that uses a headless browser to navigate to a page and collect data. This will solve some problems, as well as get dynamic content off the page.
You will need an appropriate library, such as headless_chrome, to use the browser. The script will be similar to the first example, except that the transition to the page and the collection of the page’s HTML code will be handled by the headless_chrome library. Create a new rust project with simular code to the first example. Then add in main function:
let browser = headless_chrome::Browser::default().unwrap();
let tab = browser.new_tab().unwrap();
tab.navigate_to("https://demo.opencart.com/").unwrap();
The processing of data using CSS selectors is also slightly different.
let html_products = tab.wait_for_elements("div.col").unwrap();
for html_product in html_products {
let image = html_product
.wait_for_element("image a")
.unwrap()
.get_attributes()
.unwrap()
.unwrap()
.get(1)
.unwrap()
.to_owned();
let url = html_product
.wait_for_element("h4 a")
.unwrap()
.get_attributes()
.unwrap()
.unwrap()
.get(1)
.unwrap()
.to_owned();
let title = html_product
.wait_for_element(".description h4")
.unwrap()
.get_inner_text()
.unwrap();
let description = html_product
.wait_for_element(".description p")
.unwrap()
.get_inner_text()
.unwrap();
let new = html_product
.wait_for_element(".price-new")
.unwrap()
.get_inner_text()
.unwrap();
let tax = html_product
.wait_for_element(".price-tax")
.unwrap()
.get_inner_text()
.unwrap();
let demo_product = DemoProduct {
image: Some(image),
url: Some(url),
title: Some(title),
description: Some(description),
new: Some(new),
tax: Some(tax),
};
demo_products.push(demo_product);
}
Everything else remains the same as in the first example. This method, however, also has a few challenges. While this approach gives you granular control over the scraping process, it can be overwhelming for beginners or those who need quick and simplified solutions. That’s where specialized web scraping APIs like HasData come in.
These APIs offer the advantage of handling many complexities for you, such as rotating proxies, handling CAPTCHAs, and managing browser sessions, allowing you to focus more on the data you need rather than the intricacies of the scraping process.
If this sounds like an appealing alternative, let’s dive into how you can use these APIs for your web scraping needs.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Web Scraping in Rust Using API
Before we dive into the code, let’s briefly touch on what web scraping APIs are. Essentially, these are specialized services designed to simplify the process of web scraping by taking care of the complexities involved, such as handling CAPTCHAs, rotating proxies, and managing headless browsers.
For a more detailed overview, you may refer to our separate article on the subject.
Now, let’s see how we can integrate one such API into our Rust project. Since the API returns data in JSON format, you need the optional serde library and its serde_json module. You can also connect them using Cargo:
cargo add serde
cargo add serde_json
Then, you will need the API key, which you can find on the dashboard tab in your account after registering at HasData. Let’s create a new project and set the client object and request headers in the main function:
let client = Client::builder().build()?;
let mut headers = HeaderMap::new();
headers.insert("x-api-key", HeaderValue::from_static("YOUR-API-KEY"));
headers.insert("Content-Type", HeaderValue::from_static("application/json"));
Then, let’s use the ability to add extraction rules and add them to the body of the query to get the required data at once:
let mut extract_rules = HashMap::new();
extract_rules.insert("Image", "div.image > a > img @src"); // Use space to identify src or href attribute
extract_rules.insert("Title", "h4");
extract_rules.insert("Link", "h4 > a @href");
extract_rules.insert("Description", "p");
extract_rules.insert("Old Price", "span.price-old");
extract_rules.insert("New Price", "span.price-new");
extract_rules.insert("Tax", "span.price-tax");
let extract_rules_json: Value = serde_json::to_value(extract_rules)?;
Set the rest of the query parameters:
let data = json!({
"extract_rules": extract_rules_json,
"url": "https://demo.opencart.com/"
});
And make a POST request to the API:
let request = client.post("https://api.hasdata.com/scrape")
.headers(headers)
.body(serde_json::to_string(&data)?);
let response = request.send()?;
Now script return response and you can get the data you need:
let body = response.text()?;
Print it on the screen:
println!("{}", body);
Or, use the previous examples and save this data to a CSV File.
Web Crawling with Rust
The last code example in this article will be a simple crawler that recursively traverses a website’s pages and collects all the links. We have already written about the difference between a scraper and a crawler, so we will not compare them here.
Let’s explicitly import the required modules in the file:
use reqwest::blocking::Client;
use select::document::Document;
use select::predicate::Name;
use std::collections::HashSet;
Then in the main function, we set the initial parameters and call the crawl function, which will bypass the links.
fn main() {
let client = Client::new();
let start_url = "https://demo.opencart.com/";
let mut visited_links = HashSet::new();
crawl(&client, start_url, &mut visited_links).unwrap();
}
Finally, there is the crawl function, where we check to see if we have traversed the current link, and if we have not, we perform a crawl.
fn crawl(client: &Client, url: &str, visited_links: &mut HashSet<String>) -> Result<(), reqwest::Error> {
if visited_links.contains(url) {
return Ok(());
}
visited_links.insert(url.to_string());
let res = client.get(url).send()?;
if !res.status().is_success() {
return Ok(());
}
let body = res.text()?;
let document = Document::from(body.as_str());
for link in document.find(Name("a")) {
if let Some(href) = link.attr("href") {
if href.starts_with("http") && href.contains("demo.opencart.com") {
println!("{}", link.text());
println!("{}", href);
println!("---");
crawl(client, href, visited_links)?;
}
}
}
Ok(())
}
The result:
This way, you can traverse all the site’s pages and output them to the console or use the skills you’ve learned and save them to CSV.
Zillow Scraper is a powerful and easy-to-use software that allows you to quickly scrape property details from Zillow, such as address, price, beds/baths, square footage and agent contact data. With no coding required, you can get all the data you need in just a few clicks and download it in Excel, CSV or JSON formats.
Our pre-built Amazon Product Scraper is designed to pull all detailed product information, including reviews, prices, descriptions, images and brand from departments, categories, product pages, or Amazon searches. Download your data in JSON, CSV and Excel formats.
Pros and Cons of Using Rust for Web Scraping
Rust has several disadvantages as well as many advantages. Despite the difficulty of learning and limited resources for learning scraping in Rust, it is vital due to its high performance and the ability to manage low-level processes.
Concurrency is another strength of Rust, enabling programmers to write concurrent programs that efficiently utilize system resources. Moreover, the Rust community provides vital support and collaboration opportunities. Developers can freely share their knowledge and learn from others through various forums, chat rooms, and online communities.
Nevertheless, there are a few drawbacks to consider when working with Rust. Regarding browser automation specifically, while it is possible in Rust to use external tools or crates like Selenium WebDriver bindings for other languages (not native), direct support within the pure Rust ecosystem remains relatively limited compared to some alternatives supported directly by browsers.
You can take a look at the table to review the main advantages and disadvantages of Rust:
Pros | Cons |
---|---|
1. Performance | 1. Learning Curve |
2. Memory Safety | 2. Ecosystem Maturity |
3. Concurrency | 3. Limited Browser Automation |
4. Community Support | 4. Documentation and Resources |
5. Integration with Other Rust Libraries | 5. Less Tooling |
Overall, these aspects demonstrate how Rust addresses essential pain points, addressing safety concerns and better performance potential, making it an increasingly attractive choice for many modern application domains.
Challenges of Web Scraping in Rust
Web scraping can be challenging, especially when using the Rust language. Let’s explore some of the difficulties web scraper face in Rust and discuss possible solutions.
Limited Ecosystem and Library Availability
One of the challenges of web scraping with Rust is the limited availability of libraries and tools specifically tailored for scraping. Rust has a less extensive ecosystem for web scraping than other popular languages like Python or JavaScript. As a result, developers may need to spend more time building their own custom scraping utilities or working with existing but less comprehensive libraries.
Dynamic Content and JavaScript
Many modern websites use dynamic content generated through JavaScript execution on the client side. This dynamic content poses another hurdle in web scraping because more than traditional HTML parsing is required to extract all desired information accurately.
To overcome this limitation, one solution is to leverage headless browsers with WebDriver bindings such as Headless_Chrome. This tool enables scripted interaction with websites as if you controlled an actual browser, allowing you to scrape dynamic content effectively.
CAPTCHA, Access Restrictions, and IP Blocking
We have already written about these difficulties and how to avoid them in another article, so here we will say very briefly. To protect their data and prevent scraping, websites employ measures like CAPTCHAs, access restrictions based on user-agent or IP addresses, or even temporarily blocking suspicious activities. Dealing with these obstacles can be tricky during web scraping with Rust.
There are a few strategies to mitigate these challenges, such as rotating IP addresses using proxy servers or employing libraries that help bypass CAPTCHA. By understanding and addressing these challenges head-on, developers can harness the power of Rust’s safety guarantees while building effective and efficient web scrapers.
Conclusion and Takeaways
While web scraping with Rust might present some challenges due to the language’s limited ecosystem for this specific use case, overcoming them by exploring available libraries and implementing appropriate techniques like asynchronous programming or leveraging headless browsers is still highly feasible. By understanding the hurdles involved and applying suitable solutions, developers can successfully undertake web scraping tasks in Rust efficiently.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Aug 16, 2024
JavaScript vs Python for Web Scraping
Explore the differences between JavaScript and Python for web scraping, including popular tools, advantages, disadvantages, and key factors to consider when choosing the right language for your scraping projects.
- Tools and Libraries
- Python
- NodeJS
Aug 13, 2024
How to Scroll Page using Selenium in Python
Explore various techniques for scrolling pages using Selenium in Python. Learn about JavaScript Executor, Action Class, keyboard events, handling overflow elements, and tips for improving scrolling accuracy, managing pop-ups, and dealing with frames and nested elements.
- Tools and Libraries
- Python
- Tutorials and guides