Web Scraping with AI

Valentina Skakun Valentina Skakun
Last update: 30 Apr 2024

In the digital age, extracting valuable insights from the vast amounts of data available on the web is essential for businesses, e-commerce researchers, and decision-makers. Web scraping, the process of extracting data from websites, has been a common practice for some time.

Recently, the use of AI in various industries has been gaining traction, and it has already improved and optimized certain activities. This article discusses how AI can be leveraged in web scraping, data processing, and analysis.

Traditional Web Scraping Techniques

We have already devoted many articles to scraping, from descriptions of what it is to scraping guides in various programming languages. Our blog contains articles about scraping in Python, NodeJS, C#, and other programming languages and even how to scrape websites using Google Sheets

If you want to learn more about web scraping learn how to write scraping scripts for any website, be it Zillow, Amazon, Google Maps, or anything else, we suggest you check out the detailed tutorials on our blog.

How Can Artificial Intelligence Be Used in Web Scraping

 At first, programs and tools based on artificial intelligence were only used by large companies. This was because of the significant amount of web data that had to be processed - even though it came with a high price. But with advances in AI and machine learning technology, these AI tools have become more accessible for everyone to use.

To better understand the advantages of using AI, let’s look at which areas they can benefit and how they can be used there.

Use AI for Scraping

But these are only some of the areas and uses of AI. So let’s talk more about where and how AI can be beneficial.

AI Applications in Web Scraping for Various Areas

It’s hard to think of any area where AI can’t be used. Wherever there is data, the use of such models can be beneficial.

Although, there are some areas in which they are most commonly used:

  1. E-commerce and Retail.

  2. Healthcare and Medicine.

  3. Finance and Banking.

  4. Manufacturing and Supply Chain.

  5. Marketing and Advertising.

  6. Transportation and Logistics.

  7. Education and E-Learning.

  8. Data Mining.

  9. Natural Language Processing (NLP).

  10. Cybersecurity.

One of the advantages of AI is that it can be trained on any data using machine learning. That is, there are no areas where data would be inappropriate. Anyone can make a narrowly focused model when discussing scientific development or their business.

Ways to Integrate AI in Different Workflows

If we look closer at how Artificial Intelligence (AI) can be used, it becomes clear that AI is great for data processing and analysis. For example, in social media like LinkedIn, AI can help generate leads, plan advertising campaigns, and analyze your competitors. In finance, AI can detect recent changes and compile summary analytics of the market.

But when it comes to e-commerce businesses specifically, these benefits are even more significant. Using AI, you can monitor what your competitors are doing, learn from their successes, apply their best practices to your business operations, and more.

With development or maintenance tasks like finding reliable proxy servers for each website, collecting relevant links, or aiding in web scraping activities, there’s no question that AI has something valuable to offer here too.

AI-Powered Data Scraping in Action

Since high-quality AI models appeared relatively recently, most users want to learn how to use them or know about their features. So, let’s look at how exactly you can use AI models for scraping and data processing, what rules to follow, and what limitations you will face.

However, before we move on to examples of using AI for scraping and processing data, let’s look at how to build prompts so that the AI model understands you. To start, you need to remember that the quality of the result depends entirely on you and how you formulated the task.

It is also worth specifying that you need language AI models like ChatGPT (OpenAI) for scraping and data collection. Based on our own experience, observe the following rules to get a good result:

  1. Let the model know that you will be specifying additional parameters. The AI model must be set up before you use it. You can say, “Now I will specify conditions that must be met for further data processing.

  2. Next, set the parameters. The AI model doesn’t know how you want it to act. If you want it to act in a way that is not “default,” specify this in the prompt. For example, if you don’t want to use complex vocabulary in the response, you could specify something like: “Avoid using complex vocabulary” or “Use natural language.”

  3. Specify the style or person from whom you would like to receive a response. For example, “act as a developer” or “use the official style.”

  4. If you want to compose some small text that will use your writing style, give an example of your text to the AI and ask them to remember the style and use it in the future: “Here’s an example of my text. I want you to use the same style next”.

Now that you know how to build prompts for an AI model, let’s look at options for what and how they can be used in scraping and data extraction.

Data Scraping and Processing

And the first thing you can use the language AI model for is processing and parsing text. You can solve almost any word-processing task with an AI language model. For example, you can ask the AI to put the keywords you want in your text or to remove stop words. The biggest plus, however, is that the language model doesn’t care if you’re serving plain text or text that uses hypertext markup language or something else so that you can use it as a web scraping tool.

Let’s see how this can be used. Let’s go to the citation site and DevTools (press F12). All information on the page will be in the body tag on all sites.

Select Body

To copy the content, right-click on the tag, then “copy” and “copy outerHTML.”

Copy Body

Then go to the language model (we use ChatGPT), and ask for your data to be processed:

Make a prompt with HTML Code

As a result, we got the following answer, which we present below:

Get structured data

As you can see, we got the information we needed quickly. However, this method has a lot of limitations. And one of them is the length of the query. As long as the page code is small, this is not a problem. But if we take the page of an actual store and use this method, we will get an error exceeding the number of characters.

However, there is a way around that as well. Many services allow you to pre-shrink the structure of the HTML document and remove unnecessary tags. For example, HTML cleaner is suitable for this purpose.

Let’s see how it would look for Zillow. We have already written how to scratch Zillow in Python, and now we will show how to do it with the language model. Go to the site and set the necessary parameters, then open DevTools and copy the HTML code of just the part that contains the property list. You can use the pointer (Ctrl+Shift+C) to search.

Copy main data

Then go to the HTML-cleaner page to clean the code and paste it into the code field.

Clean up the data

Clean the HTML, copy the result, go to ChatGPT, and perform the same query for quotes. As a result, we will get the following result of such data extraction:

Get structured result

Thus, we got the data from the page about all the properties in a few minutes without any effort. This method has several disadvantages, but we will talk about them later.

Data Visualization

Have you ever had one where you have some text but need to quickly put it on a table? If so, you know that AI can help. First, let’s take the previously ordered data we got from Zillow and ask for it to be returned to us in table form:

Make a table

But it was easy enough. We were giving already structured data, which was simple to put in a table. Now let’s give a textual description of the pros and cons of using AI for scraping and ask for that data back in table form:

AI good at analyzing Data

As you can see, the AI can help with this as well.

Data Analysis

As we saw earlier, AI is very good at various text processing, analysis, and other tasks.

You can give it a set of data, which will help you analyze and extract information from it. Whether you need to do statistical analysis, create visualizations, or identify patterns in data, it can help you with various data analysis tasks.

You can gain valuable insight into a specific product or service segment using AI. You can determine which products are the most popular, identify the top-performing sellers and detect which products and sellers have lower sales. By understanding this data in advance, you can make informed decisions about improving your business.

Also, AI can help you quickly analyze customer feedback, making it easier to identify what they like and dislike. When you use AI for web scraping, you can gain insights into your customers’ experiences that may have been overlooked. You’ll be able to find patterns in their responses and learn how best to improve your services or products for maximum satisfaction.

Text Classification

The language model can classify text into predefined categories or labels. This can be useful for sentiment analysis, topic classification, spam detection, or intent recognition.

Named Entity Recognition (NER)

AI can be used to identify and extract essential pieces of information from texts. This technology is helpful for businesses that want to analyze their competitors’ pages and researchers looking for critical data from different sources. Named entities such as people’s names, organizational names, locations, or dates can all be identified and extracted with the help of AI web scraping techniques.

Sentiment Analysis

AI can be used to determine the sentiment behind the text, such as customer reviews or social media posts. This is much more difficult to accomplish manually than simply analyzing ratings on a marketplace product page. Language models benefit this task, providing insights into whether an opinion expressed in words is positive, negative, or neutral.

Language Translation

We’ve all experienced the frustration of machine translation - where words are translated word-for-word, context is ignored, and sometimes meaning can be skewed. Although automatic translators have improved recently, AI could offer a more reliable solution.

AI models learn from previously collected data in a specific language, remembering patterns and standard phrases to interpret the meaning beyond individual words. Therefore, translations don’t consider only literal definitions but also consider how they fit within their broader context for better accuracy.

Text Generation

AI is mighty when it comes to generating text based on data. For example, suppose you have a service or product to offer. In that case, web scraping can help you collect leads and information from their websites and then provide AI with that data to automate the generation of tailored letters for each potential lead. You can create highly personalized offers that will be of interest and benefit based on what they need - all without ever having to write it by hand.

Scraping Code Generation

Using ChatGPT for scraping can be limited, but using AI to create scrapers or crawlers can help you overcome those limitations. You may still need help with other problems, such as solving captchas, using proxies, and avoiding blocking or JavaScript rendering. Fortunately, different use cases, such as web scraping API, can make your tasks easier and provide reliable results.

For example, we can use this prompt:

Write a scraper on Python to get data and save it to a table from the website quotes.toscrape.com with this structure:

<body></body>

The result will be the following scraper, which, as our tests show, we can immediately save and use:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send a GET request to the website
url = 'http://quotes.toscrape.com'
response = requests.get(url)

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the quote elements
quote_elements = soup.find_all('div', class_='quote')

# Initialize lists to store the extracted data
quotes = []
authors = []
tags = []

# Extract data from each quote element
for quote in quote_elements:
    # Extract the quote text
    quote_text = quote.find('span', class_='text').text
    quotes.append(quote_text)

    # Extract the author name
    author_name = quote.find('small', class_='author').text
    authors.append(author_name)

    # Extract the tags
    tag_elements = quote.find_all('a', class_='tag')
    tag_list = [tag.text for tag in tag_elements]
    tags.append(tag_list)

# Create a DataFrame from the extracted data
data = {'Quote': quotes, 'Author': authors, 'Tags': tags}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('quotes.csv', index=False)

Let’s do the same prompt, only we use NodeJS instead of Python:

const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');
const ObjectsToCsv = require('objects-to-csv');

// Send a GET request to the website
const url = 'http://quotes.toscrape.com';
axios.get(url)
  .then(response => {
    // Load the HTML response into Cheerio
    const $ = cheerio.load(response.data);

    // Find all the quote elements
    const quoteElements = $('.quote');

    // Initialize arrays to store the extracted data
    const quotes = [];
    const authors = [];
    const tags = [];

    // Extract data from each quote element
    quoteElements.each((index, element) => {
      const quoteElement = $(element);

      // Extract the quote text
      const quoteText = quoteElement.find('.text').text().trim();
      quotes.push(quoteText);

      // Extract the author name
      const authorName = quoteElement.find('.author').text().trim();
      authors.push(authorName);

      // Extract the tags
      const tagElements = quoteElement.find('.tag');
      const tagList = tagElements.toArray().map(tag => $(tag).text().trim());
      tags.push(tagList);
    });

    // Create an array of objects with the extracted data
    const data = quotes.map((quote, index) => {
      return {
        Quote: quote,
        Author: authors[index],
        Tags: tags[index]
      };
    });

    // Convert the data array to CSV format
    const csv = new ObjectsToCsv(data);

    // Save the CSV file
    csv.toDisk('quotes.csv', { append: false })
      .then(() => {
        console.log('Data successfully saved to quotes.csv');
      })
      .catch(err => {
        console.error('Error saving data to quotes.csv:', err);
      });
  })
  .catch(error => {
    console.error('Error retrieving data:', error);
  });

And for the last example, we will use the R programming language:

library(rvest)
library(writexl)

# Specify the URL of the website
url <- "http://quotes.toscrape.com"

# Send a GET request to the website and read the HTML content
page <- read_html(url)

# Extract the quote elements using CSS selectors
quote_elements <- page %>% html_nodes(".quote")

# Initialize empty vectors to store the extracted data
quotes <- character()
authors <- character()
tags <- list()

# Extract data from each quote element
for (quote_element in quote_elements) {
  # Extract the quote text
  quote <- quote_element %>% html_node(".text") %>% html_text() %>% trimws()
  quotes <- c(quotes, quote)

  # Extract the author name
  author <- quote_element %>% html_node(".author") %>% html_text() %>% trimws()
  authors <- c(authors, author)

  # Extract the tags
  tag_nodes <- quote_element %>% html_nodes(".tag")
  tag_list <- tag_nodes %>% html_text() %>% trimws()
  tags <- c(tags, list(tag_list))
}

# Create a data frame with the extracted data
data <- data.frame(Quote = quotes, Author = authors, Tags = tags, stringsAsFactors = FALSE)

# Specify the output file path
output_file <- "quotes.xlsx"

# Save the data frame to an Excel file
write_xlsx(data, path = output_file)

# Print a success message
cat("Data successfully saved to", output_file, "\n")

As we can see, we can use any programming language, and all we have to do is to set up and prepare the environment correctly.

Text Summarization

AI is also a great tool to help you sift through long texts and summarize them quickly. For instance, if you’ve gathered a list of potential leads but aren’t sure which ones might be interested in your offer or are the best fit for your business, AI can help by condensing large amounts of content into smaller summaries that capture the core points and essential details.

Question Answering

You can not only use ready-made language models. With the help of machine learning, you can create your own tailored to your datasets. For instance, if you’re running a business and want to lighten the workload of technical support staff, artificial intelligence might be just what you need.

To train an effective model for this purpose, simply feed it with factual user inquiries and responses from people in customer service roles. Once enough data is collected and processed by AI algorithms. You’ll have a powerful tool that can answer most questions about your product quickly and accurately.

Data-Driven Decision Making

When faced with complex decisions, you can discuss your problem with AI, and it can help you analyze available data, provide insights, and suggest possible courses of action based on the information provided.

However, this does not mean you must rely entirely on the AI to make decisions. It can only help and suggest options, but you must decide.

Challenges of Using AI

Web scraping can present some challenges, mainly if you use artificial intelligence (AI) to do the work. AI solutions are powerful tools that require careful consideration and implementation to succeed. Depending on the complexity of your web scraping task, AI may only sometimes offer an optimal solution and could have some challenges.

Length Limits

The first problem you must face when using AI to process or scrape data is the query length limitation. For example, the GPT-3.5 model, which is the most popular, has a maximum token limit of 4096 tokens per input. Tokens vary in length, but on average, they are equivalent to words or characters. It’s important to note that if the input exceeds this token limit, it must be truncated or shortened to fit within the constraints.

Price

Let’s take a look at using ChatGPT API and HasData Web Scraping API to see which is more profitable for scraping.

First, take a view at this research. It looked at over a billion web pages and concluded that the average web page weighs 2 MB. That said, 1 character is 1 byte. So, if a web page weighs 2 MB, it contains 1,048,576 characters or 157,286 tokens.

Special services will reduce the data to the maximum allowable for page processing in ChatGPT. In our case, it is no more than 4096 tokens. The price for 1000 tokens - is $0.002. That is, for scraping 1 page, we use $0.008.

Now let’s look at the web scraping API pricing plan. The lowest cost for an individual plan is $30 per month. It includes 50,000 credits or 5,000 requests. Let’s compare using web scraping API and AI for web scraping:

NameStart pricePrice for 5,000 requests
HasData Web Scraping API$30 per 1 month$30
ChatGPT API$5 per 3 months$40

However, as a result of our tests, we determined that using ChatGPT, you will not be able to get data from all pages because not all pages can be reduced to the required size.

Work with External Resources

Unfortunately, AI web scraping isn’t as straightforward as simply entering a link. You must consider the page structure you want to scrape and provide this information. Moreover, if you’re looking for data created after the date your AI model was trained, it won’t be able to help you out here.

AI can be Wrong

AI can generate answers based on the training data, but unlike humans, AI cannot think critically. This means that even if its answer is wrong, AI may still be very confident in its response.

Conclusion and Takeaways

Our findings show that using AI can be great for handling data. Such models do an excellent job of analyzing and assisting in processing. However, AI can only be used as an assistant tool that also needs human control.

AI models can navigate complex website structures, interpret unstructured data, and adapt to website changes over time. With its ability to learn and generalize vast amounts of data, AI opens the door to more accurate, efficient, and scalable web scraping techniques.

But it is also unsuitable for scraping large pages, can make wrong decisions if it is based on disputable data, and has several limitations.

Blog

Might Be Interesting