Google Sheets Web Scraping: A Simple Guide

Valentina Skakun Valentina Skakun
Last update: 31 Oct 2024

Web scraping enables quick data collection from websites, and Google Sheets offers a straightforward way to do this without programming knowledge. By using built-in functions and formulas, you can pull data from web pages directly into your spreadsheet. Below, we’ll cover the essential formulas and methods to get started with web scraping in Google Sheets.

Main formulas for Google Sheets web scraping

Google Sheets supports a number of formulas that allow you to work with web data in various ways. You can find a complete list of them on the official documentation page.

The most common and useful formulas for scraping data from websites are these three:

  • IMPORTXML. This formula allows you to extract data from specific XML elements on a web page. You can use XPath queries to target specific data points.

  • IMPORTFEED. This formula is used to import data from a RSS or ATOM feed. The data is imported directly into your spreadsheet.

  • IMPORTHTML. This formula is designed to extract data from HTML tables on web pages. It’s a good option for simple tables with well-defined structures.

For those with programming skills, there is another way to scrape data in Google Sheets - custom scripts written in GS, which resembles JavaScript coding language. This method provides more flexibility than simply relying on the existing formulas, as it allows users to customize their own code for particular needs.

Before diving into practical examples of these formulas, along with their advantages and disadvantages, it’s important to note that upon first using the formulas discussed in this section, you will need to enable web formulas for query execution.

IMPORTHTML example

IMPORTHTML example

Once enabled, you’ll be able to freely execute queries using formulas within the scope of this document.

IMPORTHTML - Extract Data from HTML Tables and Lists

The IMPORTHTML function is one of the simplest and most common ways to get data from a page. It allows you to extract data from tables and lists on a given web page.

To use it, you need to enter a formula in the cell of the following form:

=IMPORTHTML(url, query, index)

Instead of url, specify in quotes the link to the website from which you want to take the data. Then, specify what type of data you want to extract. This formula expects to receive the value “list” or “table” as the query parameter, depending on the type of content you want to collect. And the last parameter is the number of the table or list on the page, with the counting starting from 1. That is, if there are several lists or tables on the page, you can easily get exactly the data that you need.

Let’s consider an example and try to collect data about proxies from our page with a list of free proxies:

=IMPORTHTML("https://hasdata.com/free-proxy-list", "table", 1)

The result of executing this formula will look like this:

IMPORTXML Example

IMPORTXML Example

Now that we’ve explored how to use IMPORTHTML, let’s delve into the pros and cons of this function:

ProsCons
Simple to set up and use with basic syntax.Cannot extract data from dynamically loaded pages (e.g., JavaScript-rendered content).
Data is automatically refreshed whenever the spreadsheet is opened.Extracted data is in plain text and may require additional formatting.
Can be used to extract tables or lists by specifying the type and index.Dependent on the structure of the web page; changes to the page’s HTML can break the function.
Only works with HTML tables and lists, not other types of web content.

Overall, IMPORTHTML is a useful tool for extracting data from simple, structured web pages. However, its limitations in handling dynamic content, authentication, and potential data integrity issues should be considered when choosing a data extraction method.

IMPORTXML - Extract Data from Any Website

Unlike the previous formula, IMPORTXML allows you to retrieve data from any website using XPath. We’ve already discussed XPath in a previous article, so we won’t delve into it here.

To extract any data from a page, you can use the following template:

=IMPORTXML(url, xpath)

In addition to the page URL and the XPath of the desired elements, you can also specify a third parameter for the language and region code for which you want to retrieve data from the website. However, since this parameter is not mandatory, it can be ignored. In this case, the same localization parameters will be used as those for the document.

Let’s use IMPORTXML to retrieve all the articles from our blog:

=IMPORTXML("https://hasdata.com/blog/", "//div[@class='post-content']")

This will return the following data:

IMPORTFEED example

IMPORTFEED example

As we said, IMPORTXML() is a powerful tool for quickly and easily collecting large amounts of information in Google Sheets. It helps to scrape data from websites directly into a spreadsheet. However, it’s important to weigh up the pros and cons before deciding if this is the right option for you.

ProsCons
Easy to use with no coding knowledge requiredLimited number of queries (1000 queries per hour)
Can quickly import data into a spreadsheetLimited ability to navigate dynamic websites
Works with the majority of websitesCan be slow when scraping large amounts of data
Free to useScrape only one URL in one formula

The IMPORTXML formula is perfect for web scraping projects where you need specific pieces of information to analyze trends or process other tasks. So, it is the best choice for those who don’t good at programming.

Author’s Tip: If you’re like me and often encounter sites with dynamically loaded content — those that update or load data after the page initially appears—you’ll quickly find that IMPORTXML won’t capture this type of information. In these cases, I’ve found it helpful to switch to Google Apps Script or use a web scraping API to ensure I’m getting the most accurate, up-to-date data.

IMPORTFEED - Import RSS or Atom Feeds

The IMPORTFEED formula is a valuable tool for extracting data from RSS or Atom feeds and presenting it in a spreadsheet format. While its scope is narrower compared to other data collection formulas, it can still be a useful addition to your data gathering toolkit.

The general syntax of the formula is as follows:

= IMPORTFEED(url, [query], [headers], [num_items])

However, the only required parameter is the link to the RSS feed. It may also be useful to specify whether to use headers, in TRUE/FALSE format. As a query parameter, you can specify items to get a full table with all items or feed, which returns only one row. The last parameter allows you to specify the number of items to retrieve, if you do not specify it, all available items will be returned.

Let’s utilize the IMPORTFEED formula to extract data from the New Heights RSS feed:

=IMPORTFEED("https://feeds.megaphone.fm/newheights")

This formula will import the feed data into a spreadsheet table, displaying elements like titles, summaries, and publication dates. The result will look like this:

REGEXEXTRACT example

REGEXEXTRACT example

Now let’s consider the pros and cons of this formula:

ProsCons
Simple syntax and easy to implement for basic feed imports.Can only import data from RSS or Atom feeds, not other types of web content.
Automatically refreshes data from the feed, ensuring up-to-date information.May have issues with feeds that are not properly formatted or contain complex data structures.
Can handle changes in feed content dynamically without manual intervention.If the feed source goes down or changes format, the function may break.
Limited customization in how the data is imported and displayed.

Overall, IMPORTFEED can be a handy tool for collecting data from RSS or Atom feeds. However, its limited scope makes it less popular than other data collection formulas.

Additional Useful Formulas

In addition to the formulas discussed earlier, there are also those that are useful for processing or importing information. For example, if you want not to scrape data from a website, but to import it from another Google Sheets file.

For example, it may happen that it is not enough for you to simply extract data from a page, but you need to collect specifically the data that corresponds to the pattern you need. In this case, you can use a special formula to work with regular expressions.

In this section, we will try to show what other formulas can be used, as well as how they can be combined with those already discussed earlier to achieve the best results in the scraping process.

REGEXTRACT - Extract Specific Text Patterns from Web Pages

The REGEXEXTRACT function in Google Sheets allows you to extract text that matches a specified regular expression. While it’s not particularly useful on its own for web scraping and data parsing, it becomes incredibly effective when combined with other functions like IMPORTXML.

Utilizing these two functions together is particularly beneficial when you need to first retrieve data from a webpage using IMPORTXML and then refine the extracted information using REGEXEXTRACT. For instance, you can extract email addresses from a webpage.

While we’ve previously discussed scraping email addresses, we’ll now explore how to do so using Google Sheets. Let’s break down the REGEXEXTRACT function syntax:

=REGEXEXTRACT(text, pattern)

Consider a random New York cafe website. We’ll extract the contact email address from the webpage:

=REGEXEXTRACT(IMPORTXML("https://blackcatles.com/", "//body"), "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

In this formula, the text is the result returned by the IMPORTXML function, and we use a regular expression to extract email addresses. The extracted email address is shown in the sheet.

IMPORTDATA Example

IMPORTDATA Example

As with previous examples, let’s summarize the pros and cons of this formula in a table:

ProsCons
Powerful pattern matching capabilitiesRequires knowledge of regular expressions syntax
Can extract specific parts of text easilyCan be complex for beginners
Flexible and versatile for various data typesPerformance can be slow for large datasets
Useful for cleaning and parsing dataLimited by Google Sheets’ formula size constraints

The REGEXEXTRACT function, when combined with IMPORTXML, provides a toolset for scraping and refining data from webpages in Google Sheets. By understanding regular expressions and utilizing these functions effectively, you can extract valuable information from various websites.

IMPORTDATA - Import data from CSV or TSV files

The IMPORTDATA function in Google Sheets allows to import the data from a CSV or TSV file hosted on the internet. While it’s not the most common task, IMPORTDATA is particularly useful for importing data from a file into a spreadsheet.

The IMPORTDATA function follows a simple syntax:

=IMPORTDATA(url)

Let’s utilize data gathered from a previous article on scraping Google Trends using a no-code scraper as an example. We’ll use the URL of the resulting CSV file and plug it into our formula:

=IMPORTDATA("https://storage.googleapis.com/scrapeit-cloud-marketplace/1de2d44c-3242-44d6-8354-da5e1d5b8451.csv")

This will promptly import all the necessary data into our spreadsheet.

IMPORTRANGE Example

IMPORTRANGE Example

So, let’s take a look on it’s pros and cons:

ProsCons
Easy to use with simple URLsOnly works with CSV and TSV formats
Automatically updates when the source file changesLimited error handling capabilities
Suitable for importing structured dataCannot handle complex data structures or formats
Ideal for static and regularly updated datasetsDependent on the availability of the external URL

The IMPORTDATA function offers a straightforward approach to importing CSV or TSV files from the web into Google Sheets. Its simplicity and effectiveness make it a valuable tool for tasks involving data from online sources.

IMPORTRANGE - Import data from another Google Sheet

The final formula on our list is IMPORTRANGE, which allows you to import data from other Google Sheets. This can be useful when you need to keep data from multiple sheets up-to-date in one file and refresh it periodically.

Using the IMPORTRANGE function not only allows you to easily consolidate all the data you need into one file, but also keeps it up-to-date. Let’s look at the syntax:

=IMPORTRANGE(spreadsheet_url, range_string)

This means that you need to specify the link to the document and the range of cells from which you want to import. As an example, let’s consider the previously created document for tracking positions and scraping Google SERP using Google Sheets and import the necessary data from it:

=IMPORTRANGE("https://docs.google.com/spreadsheets/d/1HLDrQFzN3cH-vHnBYHhqgiw3p4iUFCchqW0cuIoR2Y8/edit?gid=88338269#gid=88338269", "Sheet1!A1:E10")

As a result, we imported the scraped data from Google SERP using Apps Script into another table:

So, let’s go to it’s pros and cons:

ProsCons
Enables data sharing across different sheetsRequires explicit permission to access external sheets
Easy to use with a simple formulaCan cause performance issues with large datasets
Automatically updates when the source sheet changesLimited to data within Google Sheets
Facilitates collaboration and data centralizationRequires manual update for structural changes in the source sheet

Overall, the IMPORTRANGE function is a valuable tool for importing and consolidating data from multiple Google Sheets spreadsheets. It can save you time and effort while ensuring that you always have the most up-to-date information. However, it’s important to be aware of the limitations and potential for errors when using this function.

Web scraping with Google Apps Script

Google App Scripts uses JavaScript as its programming language. With this tool, users have access to functions such as creating custom menus or buttons; accessing third-party services; manipulating documents stored in Drive; sending emails right from their sheets, or even accessing external resources.

To open the App Scripts area, go to Extensions - App Scripts.

App Scripts

As an example, we will use HasData’s web scraping API. After signing up, you’ll get 1000 free credits to help get you started.

To start, let’s create a simple script that will get all the code of the page. After launching App Script, we get to the scripting window.

Function

We can name the function differently to make it more convenient, or we can leave the same name. This affects how we refer to our script on the sheet. The function name will be the name of the formula.

If you don’t want to learn how to create your own script and want to use the finished result, skip to the final code and an explanation of how it works.

To execute a query, we need to set its parameters. You can find the API key in your account in the dashboard section. First, let’s set the headers:

var headers = {'contentType':'application/json','x-api-key': 'YOUR-API-KEY};

Next, let’s specify the body of the request. Since we need the code of the whole page, we just specify the link to the site from which we need to collect data:

var data ={'url': 'https://nalgene.com/water-bottles/wide-mouth/'};

Finally, let’s gather the query and execute it, and display the result:

var options = {
    'method': 'post',
    'headers': headers,
    'payload': data
  };
  var response = UrlFetchApp.fetch('https://api.hasdata.com/scrape', options).getContentText();
  Logger.log(response);

As a result, we get the following:

Execution Result

You can configure the data extraction settings in your account in the Web Scraping API section. There you can also visually configure the query using special functions.

For example, we can use the execution rules function to extract only the product names and prices:

 "extract_rules": {
    "title": "h2.woocommerce-loop-product__title",
    "price": "div.price-container"
  },

Adding new rules is specified in the format “Name”: “CSS selector”. It’s not very convenient to write these parameters hard in the script - to make the formula dynamic, it would be convenient to pass these parameters on the sheet. So, let’s add these data to the input. Input parameters are specified in parentheses next to the function name.

function myFunction(...rules) {…}

This means that we get several parameters that need to be put into the rules variable. For convenience, let’s rename the function and add another parameter to the input - a link to the page from which the data will be collected.

function scrape_it(url, ...rules) {…}

Now let’s declare the variables and select the headers for the future table from the extraction rules we got:

  var extract_rules = {};
  var headers = true;

// Check if the last argument is a boolean
  if (typeof rules[rules.length - 1] === "boolean") {
    headers = rules.pop();
  }

Let’s write the rules in the extract_rules variable in the form we need:

for (var i = 0; i < rules.length; i++) {
    var rule = rules[i].split(":");
    extract_rules[rule[0]] = rule[1].trim();
  }

Let’s change the body of the query by putting in variables instead of values and then execute it.

  var data = JSON.stringify({
    "extract_rules": extract_rules,
    "wait": 0,
    "screenshot": false,
    "block_resources": true,
    "url": url
  });

    var options = {
    "method": "post",
    "headers": {
      "x-api-key": "YOUR-API-KEY",
      "Content-Type": "application/json"
    },
    "payload": data
  };

  var response = UrlFetchApp.fetch("https://api.hasdata.com/scrape", options);

This query returns data as JSON with attributes that contain the required data and whose name is identical to the one we set in the extraction rules. To get this data, let’s parse the JSON response:

  var json = JSON.parse(response.getContentText());

Let’s set the attribute variable that contains the result of the extraction rules:

  var result = json["scrapingResult"]["extractedData"];

Get all the keys and the length of the largest array to know its dimensionality:

   // Get the keys from extract_rules
  var keys = Object.keys(extract_rules);

  // Get the maximum length of any array in extractedData
  var maxLength = 0;
  for (var i = 0; i < keys.length; i++) {
    var length = Array.isArray(result[keys[i]]) ? result[keys[i]].length : 1;
    if (length > maxLength) {
      maxLength = length;
    }
  }

Create a variable output, in which we put the first row of data, which will be the column names of the future table. To do this, check the headers and keys so that we can enter only those columns for which the data were found.

  // Create an empty output array with the first row being the keys (if headers is true)
  var output = headers ? [keys] : [];

Let’s go through all the elements from the extraction rules and add them to the output variable line by line.

  // Loop over each item in the extractedData arrays and push them to the output array
  for (var i = 0; i < maxLength; i++) {
    var row = [];
    for (var j = 0; j < keys.length; j++) {
      var value = "";
      if (Array.isArray(result[keys[j]]) && result[keys[j]][i]) {
        value = result[keys[j]][i];
      } else if (typeof result[keys[j]] === "string") {
        value = result[keys[j]];
      }
      row.push(value.trim());
    }
    output.push(row);
  }

Finally, let’s put the result back on the sheet:

  return output;

Now you can save the resulting script and call it directly from the sheet, specifying the necessary parameters.

Final code for web scraping with Google App Script

Before we dive into how to get started with the pre-made App Script, let’s look at what needs to be done for those who skipped the section on creating a script but want to use it.

Go to Google Sheets and create a new Spreadsheet.

Create New File

Go to Extensions and open App Scripts. In the window that appears, select all the text and replace it with our script code:

Go to App Scripts

Script code:

function scrape_it(url, ...rules) {
  var extract_rules = {};
  var headers = true;

  // Check if the last argument is a boolean
  if (typeof rules[rules.length - 1] === "boolean") {
    headers = rules.pop();
  }

  for (var i = 0; i < rules.length; i++) {
    var rule = rules[i].split(":");
    extract_rules[rule[0]] = rule[1].trim();
  }

  var data = JSON.stringify({
    "extract_rules": extract_rules,
    "wait": 0,
    "screenshot": false,
    "block_resources": true,
    "url": url
  });

  var options = {
    "method": "post",
    "headers": {
      "x-api-key": "YOUR-API-KEY",
      "Content-Type": "application/json"
    },
    "payload": data
  };

  var response = UrlFetchApp.fetch("https://api.hasdata.com/scrape", options);
  var json = JSON.parse(response.getContentText());
  var result = json["scrapingResult"]["extractedData"];

    // Get the keys from extract_rules
  var keys = Object.keys(extract_rules);

    // Get the maximum length of any array in extractedData
  var maxLength = 0;
  for (var i = 0; i < keys.length; i++) {
    var length = Array.isArray(result[keys[i]]) ? result[keys[i]].length : 1;
    if (length > maxLength) {
      maxLength = length;
    }
  }
    // Create an empty output array with the first row being the keys (if headers is true)
  var output = headers ? [keys] : [];

    // Loop over each item in the extractedData arrays and push them to the output array
  for (var i = 0; i < maxLength; i++) {
    var row = [];
    for (var j = 0; j < keys.length; j++) {
      var value = "";
      if (Array.isArray(result[keys[j]]) && result[keys[j]][i]) {
        value = result[keys[j]][i];
      } else if (typeof result[keys[j]] === "string") {
        value = result[keys[j]];
      }
      row.push(value.trim());
    }
    output.push(row);
  }

  return output;
}

Put your API key, which you can find in your account in the dashboard section at HasData instead of “YOUR-API-KEY” and save it.

Now let’s look at the result of the work. Let’s go back to the sheet, put the link to the page in cell A1, and in the next cell specify the formula. In general, the formula looks like this:

=scrape_it(URL, element1, [element2], [false])

Where:

  1. The URL is a link to the page to be scraped.

  2. Elements (you can specify more than one). Elements are specified in the format “Title: CSS_Selector @attribute”. Instead of Title specify the name of the element, then via a colon specify the CSS selector of the element and, if necessary, via a space and @ specify the attribute, which value is necessary. If only element text is needed, no attribute is specified. For example:

  3. “title: h2” – to get product name.

  4. “link: a @href” – to get product link.

  5. [false]. This is an optional parameter that defaults to true. It indicates whether or not the headers for the columns should be saved. If you don’t specify anything, the header will be specified in the first cell and will be taken from the Elements parameter. If you specify false, column headers will not be specified and saved.

Now, using all the skills we learned, let’s get a table identical to the one we got with IMPORTXML but using the scrape_it formula:

=scrape_it(A1,"title: h2.woocommerce-loop-product__title", "price: div.price-container","link:div.add-on-product>a @href", "Image:div.add-on-product__image-container>img @src")

As a result, we got the following table:

scrape_it formula

However, we have to get the description from every single product page, so we don’t need to specify headers. In this case, we need to specify the third additional parameter false, at the end of the formula:

Scrape Description

Scraping with Google Sheets and App Scripts can be a good tool for data collection and analysis. With the help of Google’s scripting language, you can easily access external websites to retrieve information, use web scraping API for automation processes or even manipulate results in whatever way suits your purpose.

It comes in handy where IMPORTXML cannot be used. For example, you can’t use IMPORTXML for sites with dynamic rendering. However, using Google App Scripts together with the Web Scraping API can solve this problem. The considered script is suitable for scraping any site, no matter what platform it is built on, whether it has dynamic rendering or not.

Blog

Might Be Interesting