Web Scraping with Ruby
Web scraping is the process of automatically extracting information from web pages. It’s a powerful technique that allows developers to quickly and easily gather data from websites without having to manually enter or download it. Web scraping can be used for many different purposes, such as collecting product prices, gathering contact information, or analyzing trends on social media sites.
One of the most popular programming language for web scraping is Ruby because of its open source, flexibility and ease-of-use. We’ve already discussed web scraping in Python, C#, NodeJS and R but in this article let’s look at Ruby. With Ruby you can write complex scripts that automate the entire data collection process - from accessing a website page to parsing out relevant pieces of information (like email addresses). Ruby also has a wide range of additional libraries designed specifically for web scraping purposes. You can find plenty of libraries to choose at github. But in this article we will focus on the most widely used and well-known ones only.
Preparing for web scraping with Ruby
Before starting to create a Ruby-based web scraper, we need to prepare the environment, consider and install the necessary libraries. First, prepare the environment and install Ruby and then let’s look at libraries and install them.
Installing the environment
The official Ruby website provides commands to install Ruby on all popular operating systems, be it Debian, CentOS, Snap, MacOS, OpenBSD, Windows or others. We note, that there is also a build for Windows that includes Ruby and the base packages. This option is suitable for those who want to simplify their Ruby installation. If you decide to use the installer, don’t forget to check the boxes in the necessary places during the installation:
This is necessary so that the computer knows where the executable file is and will associate all files with *.rb and *.rbw extensions with Ruby.
After you have installed Ruby, you can check if everything went well by running the command in the command line:
ruby -v
This should return a line with the ruby version:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]
Now, decide where you are going to write the code. In fact, this is not very important, you can run the code file from the command line and write the code even in Notepad. However, it is better to use special tools for this, which will highlight the syntax and tell you where the errors are. For these purposes you can use Sublime, Visual Code or something else.
Installing the libraries
When this part is done, you can start installing the libraries. At this tutorial, we will look at the following libraries:
HTTParty. Full-featured query library, allows you to perform GET,POST,PUT, DELETE queries. While it’s not specifically designed for web scraping, it can be useful for fetching data from web pages and APIs.
Net::HTTP. Another library that allows you to execute and process queries.
Nokogiri. This is a complete library for parsing and processing XML and HTML documents. It can’t execute queries, but it’s great for processing output data. Its main advantage is the ability to work with CSS selectors but it can’t work with XPath.
Mechanize. This is the second most popular library used for ruby web scraping. As opposed to Nokogiri, it provides the ability to require data itself.
Watir. It is a web application testing framework that can also be used for web scraping. It allows you to automate interactions with web pages in a way that’s similar to Mechanize. But it also can use headless browser.
Effortlessly extract Google Maps data – business types, phone numbers, addresses, websites, emails, ratings, review counts, and more. No coding needed! Download results in convenient JSON, CSV, and Excel formats.
Discover the easiest way to get valuable SEO data from Google SERPs with our Google SERP Scraper! No coding is needed - just run, download, and analyze your SERP data in Excel, CSV, or JSON formats. Get started now for free!
Besides those we mentioned earlier, there are many other Ruby gems for web scraping, such as PhantomJS, Capybara or Kimurai gemfile. However, the libraries suggested above are more popular and better documented, so in this article we will focus on them.
To install packages in Ruby, use the gem install command:
gem install httparty
gem install nokogiri
gem install mechanize
gem install watir
There is no need to install Net::HTTP, because it is preinstalled. You can use ruby gem to check it:
Page Analysis
As an example, let’s use a test page containing books that we can scrape. First, let’s go to the page and look at its HTML code. To open the HTML page code, go to DevTools (press F12 or right-click on an empty space on the page and go to Inspect).
All products on the page are placed in the <ol> tag with “row” class and each individual product - in the sub-tag <li>. Let’s take a closer look at one of the products:
Based on the HTML code, we can get the following data:
Image link. Located in the <a> tag and is the content of the “href” attribute.
Rating. There are two classes in the <p> tag, star-rating and book-rating. In our example book-rating is 3.
Title. The title of the book is in the <a> tag. However, it is not fully specified in the tag. The full title is in the “title” attribute.
Price. Here we just need to get the content of the <div> tag with the class “product_price”.
Availability. Here it is necessary to pay attention to the class. For those books that are available, the class “icon-ok” is used.
Now that you have all the necessary components in place, it’s time to start building your scraper. With Ruby and the right tools, scraping data from websites is relatively easy.
Creating a web scraper
Create a new *.rb file, for example “scraper.rb”, and open it. In this file, we will write code to scrape data using Ruby. To start, we’ll take a look at each of the installed libraries one by one to see how they can help us fetch information from websites or other sources.
Make http requests with HTTParty
The first library on our list is the HTTParty library. It doesn’t allow you to process or parse data, but it allows you to execute queries. Let’s connect it:
require "httparty"
Let’s run the query and get the code of the books.toscrape.com page:
response = HTTParty.get("https://books.toscrape.com/catalogue/page-1.html")
To display the result, you can use the puts() command:
puts(response)
As a result, we will get the HTML code of the page:
D:\scripts\ruby>ruby scraper.rb
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
<head>
<title>
All products | Books to Scrape - Sandbox
...
</title>
</head>
</html>
Unfortunately, many websites try to block scrapers from accessing their data. To disguise your scraper so it can access the data, there are a few methods you can use: using proxies and web scraping APIs; setting up random delays between requests; using headless browsers. It’s also important to specify a user agent in each request sent out — this will make it more likely for your scraper to avoid being blocked.
response = HTTParty.get("https://books.toscrape.com/catalogue/page-1.html", {
headers: {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
},
})
Remember, that it is better to use a real User-Agent. You can find the latest User-Agents at our another article.
Web Scraping API allows you to scrape web pages without the hassle of managing proxies, headless browsers, and captchas. Simply send the URL and get the HTML response in return.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Scrape Data with HTTParty and Web Scraping API
As we said earlier, the HTTParty library is only designed for sending out requests. However, we can use the web scraping API to process the data. To do this we need an API key, which you can get after signing up for HasData. We will also need an additional library:
gem install json
Connect this library in our file:
require "json"
Now, let’s make a request to the API. We need to provide some information that tells it what we’re looking for. This includes specifying the request headers (like which type of data we want and your API key) and the body (the actual content of our request):
url = "https://api.hasdata.com/scrape"
headers = {
"x-api-key" => "YOUR-API-KEY",
"Content-Type" => "application/json"
}
payload = {
extract_rules: {
title: "h3>a @title",
price: "div.product_price>p.price_color",
image: "div.image_container>a @href",
rating: "p.star-rating @class",
available: "p.availability>i @class"
},
wait: 0,
screenshot: true,
block_resources: false,
url: "https://books.toscrape.com/catalogue/page-1.html"
}.to_json
Then let’s execute the query:
response = HTTParty.post(url, headers: headers, body: payload)
parsed_page = JSON.parse(response.body)
Now we can refer to the attributes of the JSON response returned to us by HasData API, which contains the necessary data:
parsed_page = JSON.parse(response.body)
extracted_data = parsed_page["scrapingResult"]["extractedData"]
Now that the data is in, let’s save it to a CSV file.
Save Data to CSV
Let’s get started by installing the CSV library:
gem install csv
Connect the library to a file:
require "csv"
To save the data to a file, we need to open it with the right settings. We’ll use “w” - if the file doesn’t already exist, this will create one for us. If it does exist, then any existing content will be overwritten. Plus, we can define what character separates each piece of information – we will use a semi-colon (;):
CSV.open("data.csv", "w", col_sep: ";") do |csv|
…
end
Let’s set the column titles:
csv << ["Title", "Price", "Image", "Rating", "Availability"]
Finally, let’s go through all the arrays we got line by line and put the data into a file:
[extracted_data["title"], extracted_data["price"], extracted_data["image"], extracted_data["rating"], extracted_data["available"]].transpose.each do |row|
csv << row
end
As a result, we got the following table:
It looks great, but we can make it even better by cleaning up the data. Fortunately, this is easy to do with Ruby’s built-in features. We can start by pound symbols (£) in prices, which show up as “BJ” when saved to a file:
extracted_data["price"] = extracted_data["price"].map { |price| price.gsub("£", "") }
We have two different classes in the “rating” column. We only need the second one, so let’s solve this problem by using the split() method. This will take our string and separate it into an array of strings based on a specified pattern or character:
extracted_data["rating"] = extracted_data["rating"].map { |rating| rating.split(" ").last }
Now let’s run the query again and look at the table:
Now the result looks better and will be easier to work with in the future.
Resulting code:
require "httparty"
require "json"
require "csv"
url = "https://api.hasdata.com/scrape"
headers = {
"x-api-key" => "YOUR-API-KEY",
"Content-Type" => "application/json"
}
payload = {
extract_rules: {
title: "h3>a @title",
price: "div.product_price>p.price_color",
image: "div.image_container>a @href",
rating: "p.star-rating @class",
available: "p.availability>i @class"
},
wait: 0,
screenshot: true,
block_resources: false,
url: "https://books.toscrape.com/catalogue/page-1.html"
}.to_json
response = HTTParty.post(url, headers: headers, body: payload)
parsed_response = JSON.parse(response.body)
extracted_data = parsed_response["scrapingResult"]["extractedData"]
# Remove pound symbol (£) from price values
extracted_data["price"] = extracted_data["price"].map { |price| price.gsub("£", "") }
# Extract only the second word from the rating values
extracted_data["rating"] = extracted_data["rating"].map { |rating| rating.split(" ").last }
CSV.open("data.csv", "w", col_sep: ";") do |csv|
csv << ["Title", "Price", "Image", "Rating", "Availability"]
[extracted_data["title"], extracted_data["price"], extracted_data["image"], extracted_data["rating"], extracted_data["available"]].transpose.each do |row|
csv << row
end
end
puts "Data saved to data.csv"
Now that we’ve gone over the HTTParty library, let’s take a look at how it would appear if we used Net::HTTP instead. We’ll be able to gain an understanding of what makes these libraries unique and figure out which is best for our scraping needs.
Make requests with NET::HTTP
The Net::HTTP library, which is provided by the net-http gem, can be used in combination with the open-uri gem for added functionality. Net::HTTP provides access to the underlying HTTP protocol, while open-uri makes it simpler and more efficient to request data from a remote server. Together they provide an effective way of getting information from websites quickly.
First, let’s connect the libraries:
require "uri"
require "net/http"
Then we need to parse the URL string into a URI (Uniform Resource Identifier) object:
url = URI.parse("https://books.toscrape.com/catalogue/page-1.html")
It helps in breaking down the URL into its constituent parts, allowing easy access and manipulation of different components as needed. And, finally, we can execute a query and put it out in the command line:
request = Net::HTTP.get_response(url)
puts request.body
With Net::HTTP library, in this way you can make a simple ‘GET’ request to fetch data from the web. This can be a great way to quickly access and extract data from any website.
Scrape Data with Net::HTTP and Web Scraping API
We’re going to use the HasData API for this task. First, we need to define the API endpoint to send our request to. We also need to specify the headers and body of the POST request. Basically, this means that we’re providing the API with the details of what we want to scrape. We’ll need to give it the URL of the page we want to scrape, and let it know what content we want extract:
url = URI("https://api.hasdata.com/scrape")
https = Net::HTTP.new(url.host, url.port)
https.use_ssl = true
request = Net::HTTP::Post.new(url)
request["x-api-key"] = "YOUR-API-KEY"
request["Content-Type"] = "application/json"
request.body = JSON.dump({
"extract_rules": {
"title": "h3>a @title",
"price": "div.product_price>p.price_color",
"image": "div.image_container>a @href",
"rating": "p.star-rating @class",
"available": "p.availability>i @class"
},
"wait": 0,
"screenshot": true,
"block_resources": false,
"url": "https://books.toscrape.com/catalogue/page-1.html"
})
Then run the query and display the data on the screen:
response = https.request(request)
puts response.body
As we can see, using this library isn’t much different from the one we looked at before. So, let’s use the same code as at previous example and save the data in CSV format. To avoid much repetition, let’s show the full version of the finished script:
require "uri"
require "json"
require "net/http"
require "csv"
url = URI("https://api.hasdata.com/scrape")
https = Net::HTTP.new(url.host, url.port)
https.use_ssl = true
request = Net::HTTP::Post.new(url)
request["x-api-key"] = "YOUR-API-KEY"
request["Content-Type"] = "application/json"
request.body = JSON.dump({
"extract_rules": {
"title": "h3>a @title",
"price": "div.product_price>p.price_color",
"image": "div.image_container>a @href",
"rating": "p.star-rating @class",
"available": "p.availability>i @class"
},
"wait": 0,
"screenshot": true,
"block_resources": false,
"url": "https://books.toscrape.com/catalogue/page-1.html"
})
response = https.request(request)
parsed_response = JSON.parse(response.body)
extracted_data = parsed_response["scrapingResult"]["extractedData"]
extracted_data["price"] = extracted_data["price"].map { |price| price.gsub("£", "") }
extracted_data["rating"] = extracted_data["rating"].map { |rating| rating.split(" ").last }
CSV.open("data.csv", "w", col_sep: ";") do |csv|
csv << ["Title", "Price", "Image", "Rating", "Availability"]
[extracted_data["title"], extracted_data["price"], extracted_data["image"], extracted_data["rating"], extracted_data["available"]].transpose.each do |row|
csv << row
end
end
puts "Data saved to data.csv"
Now that we’ve taken a look at the query libraries, let’s move on to exploring the libraries that provide data processing capabilities. These libraries allow us to transform and manipulate the data we scrape in meaningful ways.
Parse the data with Nokogiri
Nokogiri is an incredibly useful library for both processing and parsing data. It’s easy to use and very popular among Ruby developers. It provides an effective way to work with HTML and XML documents, making the process of scraping data fast and easy.
Unfortunately, Nokogiri is not a standalone scraping library and lacks the ability to send out requests. Additionally, it is great for retrieving data from static pages, but you cannot use it to query and process data from dynamic pages. In the previous examples, we solved these issues with using of web scraping API. With Nokogiri, these are all problems that the developer will have to solve themselves.
However, if you only need to scrape data from static pages, you don’t need to perform any actions on the page (such as performing authorization) and you need a simple library, then Nokogiri is exactly what you need.
To use it, we need a query library. Here we’ll be using HTTParty but you can use any library that suits your needs. To begin with, let’s connect the two libraries and get the entire code of our page:
require 'httparty'
require 'nokogiri'
headers = {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
url = https://books.toscrape.com/catalogue/page-1.html"
response = HTTParty.get(url, headers: headers)
Now pass the code you got to the Nokogiri object. It will then parse through the page structure and make it easier for you to access different elements on that page:
doc = Nokogiri::HTML(response.body)
Now we can get the necessary data from the page using CSS selectors:
titles = doc.css('h3 > a').map { |element| element['title'] }
prices = doc.css('div.product_price > p.price_color').map { |element| element.text.gsub("£", "") }
images = doc.css('div.image_container > a').map { |element| element['href'] }
ratings = doc.css('p.star-rating').map { |element| element['class'].split(' ').last }
availabilities = doc.css('p.availability > i').map { |element| element['class'].split('-').last }
We improved the information we have using tactics to address problems we came across earlier. To begin, we used a function called { |element| element[‘Attribute’] } which replaces ‘Attribute’ with the name of an attribute whose content you want to access. This allows us to get each element’s attribute content instead of text.
We then used the gsub(”£”, "") command to clean up each element in the price array. We need to remove the pound sign because it is incorrectly displayed when saving the data to a file. Finally, we used split("") to separate the rating and availabilities arrays into two parts. We then used .last on each array so that only the second part remained.
As we can see, using Nokogiri is quite straightforward and won’t pose any difficulties even for beginners.
Scraping Multiple Pages
Now that we know how to extract data from a single page, let’s use it to collect information from every page on the website. After all, creating scrapers isn’t usually just for one page; they’re often used to scrape an entire website or online store.
Unfortunately, using Nokogiri, we cannot switch to the next page by pressing the button. However, we can find the pattern by which links are made on the site and suggest what the following pages will look like.
To do this, take a close look at the page links:
https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/page-2.html
...
https://books.toscrape.com/catalogue/page-50.html
As we can see, only the page number changes at this example site. In total, there are 50 pages. Let’s put the unchanged starting part in the base_url variable:
base_url = 'https://books.toscrape.com/catalogue/page-'
Also for all requests we will have constant user-agent:
headers = {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
Finally, we need to set the number of pages and the variable in which data from all pages will be stored:
total_pages = 50
data = []
Actually, we could take our scraper to the next level and not have to manually specify how many pages there are. Instead, we can use code that gets the number of pages from the page itself or loop until it hits a “404 Not found” error. However, in this case such code complication is not required, as the links have a clear structure and are easily predictable.
Extract valuable business data such as names, addresses, phone numbers, websites, ratings, and more from a wide range of local business directories. Download your results in user-friendly formats like CSV, JSON, and Excel for easy analysis.
Find and extract emails from any website with ease. Build targeted email lists for lead generation, outreach campaigns, and market research. Download your extracted data in your preferred format (CSV, JSON, or Excel) for immediate use.
Now, let’s create a loop to go through each page and generate the appropriate link. The number of iterations in the loop will correspond with the page number. For example, on the first iteration of the loop, we’ll construct a link for the first page; on second iteration – second page; and so forth:
(1..total_pages).each do |page|
url = "#{base_url}#{page}.html"
response = HTTParty.get(url, headers: headers)
doc = Nokogiri::HTML(response.body)
end
CSS selectors and retrievable variables remain unchanged. So let’s move on to putting the data into a hash:
page_data = titles.zip(prices, images, ratings, availabilities).map do |title, price, image, rating, availability|
{ title: title, price: price, image: image, rating: rating, availability: availability }
end
Finally, we put the data at the end of the data variable that we created at the beginning:
data.concat(page_data)
Also, it is recommended to add a random delay to increase the chances of bypassing the blocking:
sleep(rand(1..3))
After running the cycle, save the data to a CSV file:
CSV.open('book_data.csv', 'w', col_sep: ";") do |csv|
csv << data.first.keys # Write the headers
data.each { |hash| csv << hash.values } # Write the data rows
end
As a result, our script will bypass all pages, collect data and put them into a CSV file:
Full code:
require 'httparty'
require 'nokogiri'
require 'csv'
base_url = 'https://books.toscrape.com/catalogue/page-'
headers = {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
total_pages = 50
data = []
(1..total_pages).each do |page|
url = "#{base_url}#{page}.html"
response = HTTParty.get(url, headers: headers)
doc = Nokogiri::HTML(response.body)
# Extracting data
titles = doc.css('h3 > a').map { |element| element['title'] }
prices = doc.css('div.product_price > p.price_color').map { |element| element.text.gsub("£", "") }
images = doc.css('div.image_container > a').map { |element| element['href'] }
ratings = doc.css('p.star-rating').map { |element| element['class'].split(' ').last }
availabilities = doc.css('p.availability > i').map { |element| element['class'].split('-').last }
# Combine the extracted data into an array of hashes
page_data = titles.zip(prices, images, ratings, availabilities).map do |title, price, image, rating, availability|
{ title: title, price: price, image: image, rating: rating, availability: availability }
end
data.concat(page_data)
sleep(rand(1..3))
end
# Save the data to a CSV file
CSV.open('book_data.csv', 'w', col_sep: ";") do |csv|
csv << data.first.keys # Write the headers
data.each { |hash| csv << hash.values } # Write the data rows
end
puts 'Data saved to book_data.csv'
As we can see, even the capabilities of the Nokogiri library can be enough for scraping websites.
Web scraping with Mechanize
Mechanize is a comprehensive library for web scraping. With it, you don’t need any additional libraries – all the necessary tools are included. You can use Mechanize to both send queries and process the results.
To use the Mechanize library, let’s connect it in our file:
require 'mechanize'
We can also use the library’s built-in features to execute the request:
agent = Mechanize.new
page = agent.get('https://books.toscrape.com/catalogue/page-1.html')
Using the same CSS selectors as before, you can use Mechanize to access and extract the data:
titles = page.css('h3 > a').map { |element| element['title'] }
prices = page.css('div.product_price > p.price_color').map { |element| element.text.gsub("£", "") }
images = page.css('div.image_container > a').map { |element| element['href'] }
ratings = page.css('p.star-rating').map { |element| element['class'].split(' ').last }
availabilities = page.css('p.availability > i').map { |element| element['class'].split('-').last }
Otherwise, using this library is the same as using Nokogiri.
Getting Categories with Mechanize
Let’s say we want to collect data from various categories. We’ll start by looping through each category and then move on to the individual pages within that category. For each product, we’ll store it in a table along with its associated category label.
First, go to the home page and get the names and links of all the categories:
base_url = 'https://books.toscrape.com'
category_links = []
category_names = []
product_data = []
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari' # Set the User-Agent header
page = agent.get(base_url)
page.css('div.side_categories > ul > li > ul > li > a').each do |element|
category_links << element['href']
category_names << element.text.gsub(" ", "").strip
end
Here we used gsub() to get rid of extra spaces, and strip to remove empty lines. To go through all categories, we use a loop:
category_links.each_with_index do |category_link, index|
…
end
The CSS selectors will remain the same, so we won’t duplicate them, but the hash will change a little so that we can keep the category name:
product_data << { category: category_names[index], title: title, price: price, image: image, rating: rating, availability: availability }
By running our script, we get a table of data:
Full code:
require 'mechanize'
require 'csv'
base_url = 'https://books.toscrape.com'
category_links = []
category_names = []
product_data = []
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari' # Set the User-Agent header
# Get category links and names
page = agent.get(base_url)
page.css('div.side_categories > ul > li > ul > li > a').each do |element|
category_links << element['href']
category_names << element.text.gsub(" ", "").strip
end
# Iterate over each category
category_links.each_with_index do |category_link, index|
category_url = "#{base_url}/#{category_link}"
page = agent.get(category_url)
# Scrape product data
page.css('article.product_pod').each do |product|
title = product.css('h3 > a').first['title']
price = product.css('div.product_price > p.price_color').text.gsub("£", "")
image = product.css('div.image_container > a > img').first['src']
rating = product.css('p.star-rating').first['class'].split(' ').last
availability = product.css('p.availability > i').first['class'].split('-').last
product_data << { category: category_names[index], title: title, price: price, image: image, rating: rating, availability: availability }
end
sleep(rand(1..3)) # Add a random delay between requests
end
# Save the data to a CSV file
CSV.open('book_data.csv', 'w', col_sep: ";") do |csv|
csv << product_data.first.keys # Write the headers
product_data.each { |hash| csv << hash.values } # Write the data rows
end
puts 'Data saved to book_data.csv'
In conclusion, it’s not too hard to scrape product details such as category names and store them in a CSV file when you use the Mechanize library.
Scrape dynamic data with Watir using a headless browser
Watir framework stands apart from the other scraping libraries because it can be used with a headless browser. This means our code will act as a real user and has more chances to not get blocked. It works by allowing us to see or hide the browser window while still accessing all its features, like clicking links and filling out forms.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Automate the extraction of trending search data and gain valuable insights from Google Trends with our API to make your market analysis faster, more accurate, and more efficient.
To get started, we’ll need to install a web driver. This is a piece of software that helps us interact with the internet to scrape data.
gem install webdrivers
We’ll be using the Chrome web driver for this tutorial, but you can also use other web drivers supported by Watir like Firefox or Safari. So, let’s connect the necessary libraries and create a browser object:
require 'watir'
require 'webdrivers'
browser = Watir::Browser.new(:chrome)
You can also use Selenium for the same result:
browser = Selenium::WebDriver::Chrome::Options.new
The goto() command is used to navigate to a different page. It can be used for quickly jumping between pages as you scrape data from multiple sources:
browser.goto(' https://books.toscrape.com/index.html')
Scraping with Watir has a big advantage - you don’t have to add extra parameters when you make your query. That’s because the script creates an actual browser window and takes care of all the transitions between pages, so it already contains all the necessary information.
Go to Next Page with Watir
Let’s improve our past code. Suppose we still want to collect all items from each category, but this time we’ll keep going through the pages of that category until there’s no longer a “Next” button.
Look at the Next button:
It stores a link to the next page and has the class “next”. Which means we can make a loop with the following condition:
base_url = 'https://books.toscrape.com'
loop do
…
next_link = browser.li(class: 'next').a
break unless next_link.exists?
browser.goto("#{base_url}/#{next_link.href}")
end
By nesting a loop inside of an existing category loop, we can iterate through each page in all categories quickly and efficiently. This will allow us to bypass the exhausting task of manually navigating each page of every category.
Data access with Watir is different from the libraries we’ve covered before, yet it makes interacting with elements in the DOM (Document Object Model) structure much easier. For example, get the same data as before:
product_data =[]
browser.articles(class: 'product_pod').each do |product|
title = product.h3.a.title
price = product.div(class: 'product_price').p(class: 'price_color').text.gsub('£', '')
image = product.div(class: 'image_container').a.href
rating = product.p(class: 'star-rating').class_name.split(' ').last
availability = product.p(class: 'availability').i.class_name.split('-').last
product_data << { category: category_names[index], title: title, price: price, image: image, rating: rating, availability: availability }
end
At the end of the script, be sure to specify a command to close the browser:
browser.quit
Full code:
require 'watir'
require 'webdrivers'
require 'csv'
base_url = 'https://books.toscrape.com'
category_links = []
category_names = []
product_data = []
# Launch a browser (in this case, Chrome)
browser = Watir::Browser.new(:chrome)
# Navigate to the main page
browser.goto(base_url)
# Get category links and names
browser.div(class: 'side_categories').ul(class: 'nav').li(class: 'active').ul.lis.each do |li|
link = li.a.href
name = li.a.text.gsub(' ', '')
category_links << link
category_names << name
end
# Iterate over each category
category_links.each_with_index do |category_link, index|
category_url = "#{base_url}/#{category_link}"
browser.goto(category_url)
loop do
# Scrape product data from the current page
browser.articles(class: 'product_pod').each do |product|
title = product.h3.a.title
price = product.div(class: 'product_price').p(class: 'price_color').text.gsub('£', '')
image = product.div(class: 'image_container').a.href
rating = product.p(class: 'star-rating').class_name.split(' ').last
availability = product.p(class: 'availability').i.class_name.split('-').last
product_data << { category: category_names[index], title: title, price: price, image: image, rating: rating, availability: availability }
end
# Check if there's a "Next" link
next_link = browser.link(class: 'next')
break unless next_link.exists?
# Navigate to the next page
browser.goto(next_link.href)
sleep(rand(1..3)) # Add a random delay between requests
end
end
# Close the browser
browser.quit
# Save the data to a CSV file
CSV.open('book_data.csv', 'w', col_sep: ';') do |csv|
csv << product_data.first.keys # Write the headers
product_data.each { |hash| csv << hash.values } # Write the data rows
end
puts 'Data saved to book_data.csv'
After using Watir at practice we can say that it is a powerful Ruby library that provides a simple way to automate browser interactions.
While there are other alternatives for web scraping in Ruby, such as Nokogiri and Mechanize, Watir provides a more comprehensive solution for automating browser interactions and offers a higher level of control over web scraping tasks.
Conclusion and Takeaways
Ruby makes web scraping a great option for developers who need to collect large amounts of data quickly and accurately. In this article, we discussed the different ways of scraping data with Ruby. We started off with the simplest options such as query library and web scraping API before delving into more complex solutions like using Watir and web driver.
You should remember, when deciding which library or technique to use, consider factors such as the complexity of the target website, the need for JavaScript rendering, the requirement for browser automation, and your familiarity with the library.
If you’re just starting out with Ruby, a good option is to use an API for web scraping. This way, you don’t need to worry about complicated things like JavaScript rendering, CAPTCHA solving, using proxies or avoiding blocks. If it is not your case and you need somethig simple, Nokogiri or Mechanize are ideal for scraping static resources. But if you need more advanced capabilities - such as being able to emulate real user behavior on a page - then Watir and Webdriver (or similar, for example Ruby on Rails) are the best choice.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Aug 16, 2024
JavaScript vs Python for Web Scraping
Explore the differences between JavaScript and Python for web scraping, including popular tools, advantages, disadvantages, and key factors to consider when choosing the right language for your scraping projects.
- Tools and Libraries
- Python
- NodeJS
Aug 13, 2024
How to Scroll Page using Selenium in Python
Explore various techniques for scrolling pages using Selenium in Python. Learn about JavaScript Executor, Action Class, keyboard events, handling overflow elements, and tips for improving scrolling accuracy, managing pop-ups, and dealing with frames and nested elements.
- Tools and Libraries
- Python
- Tutorials and guides