Web Scraping Google News using Python: Step-by-Step Guide

Valentina Skakun Valentina Skakun
Last update: 11 Nov 2024

Google News is the biggest news aggregator out there, available in 141 countries and offering content in 41 languages. By 2021, it had over 1.5 billion users, making it one of the go-to platforms for keeping up with the latest news or doing market research.

In this article, we’ll show you how to build your own Google News scraper that automatically collects fresh news and helps you track topics you’re interested in. We’ll guide you through using the Google SERP API and web scraping tools like Beautiful Soup and Selenium for easy, automated data collection. These methods open up a world of possibilities beyond just staying updated on the news. Discover an easier way to stay in the know!

How to Scrape Google News using API

There are two ways to extract news from Google search results: using a Python library for web scraping, or using the Google News API. The API option is a great choice for beginners and anyone who wants to avoid the hassle of dealing with blocking, captchas, and proxy rotation.

The Google News API gives you data in a JSON format, which is easy to process and work with. Let’s see how to scrape google news headlines and descriptions using the Google News API, what you need, and how to save the obtained data in Excel.

Sign Up and Get an API key

To use the API, you need an API key. To get it, go to the HasData website and sign up.

Dashboard

Go to the Dashboard tab in your account and copy your personal API key. We will need it later.

Full Code: Google News Scraping with API

Before we dive into the step-by-step process of creating the script, here’s a ready-made example for those who just want the end result right away:

import requests
import pandas as pd

keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
    'q': keyword,
    'domain': 'google.com',
    'tbm': 'nws'
}

try:
    response = requests.get(api_url, params=params, headers=headers)
    if response.status_code == 200:
        data = response.json()
        news = data['newsResults']
        df = pd.DataFrame(news)
        df.to_excel("news_result.xlsx", index=False)
        df.to_csv("news_result.csv", index=False)
except Exception as e:
    print('Error:', e)

If you’re eager to test the script immediately and aren’t interested in diving into the details of its creation, feel free to jump over to Google Colaboratory. Just enter your API key and keyword, and run the script right away. However, if you’d like to go through the step-by-step process of building the script, feel free to continue to the next section, where I’ll explain everything in detail.

Set the Parameters

First, let’s install the necessary libraries. To do this, specify the following in the command prompt:

pip install requests
pip install pandas

The Requests library is a request library that will allow us to request the API to get the necessary data. And the Pandas library is needed to process the data and then save it as an Excel file.

Now that the libraries are installed create a file with *.py extension and import them.

import requests
import pandas as pd

Now let’s set the parameters that can be put into variables. There are only two of them: a reference to the API endpoint and a keyword.

keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'

The last thing to set is the headers and body of the request. The header contains only one parameter - the API key. But the request body can contain many parameters, including localization parameters. The full list of parameters can be found in our documentation.

In this example, we will use only the necessary parameters:

headers = {'x-api-key': 'YOUR-API-KEY'}

params = {
    'q': keyword,
    'domain': 'google.com',
    'tbm': 'nws'
}

We specified the keyword, domain, and type. The remaining parameters can be left unspecified, but they can be used to fine-tune the query and get more specific results.

Make a Request

Now that all the necessary parameters are specified execute the request:

response = requests.get(api_url, params=params, headers=headers)

HasData’s Google News API uses a GET request and provides a JSON response in the following format:

{
  "requestMetadata": {
    "id": "57239e2b-02a2-4bfb-878d-9c36f5c21798",
    "googleUrl": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIaQXVzdGluLFRleGFzLFVuaXRlZCBTdGF0ZXM%3D&gl=us&hl=en&filter=1&tbm=nws&oq=Coffee&sourceid=chrome&num=10&ie=UTF-8",
    "googleHtmlFile": "https://storage.googleapis.com/scrapeit-cloud-screenshots/57239e2b-02a2-4bfb-878d-9c36f5c21798.html",
    "status": "ok"
  },
  "pagination": {
   "next": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8NMDegQIAhAW",
    "current": 1,
    "pages": [
      {
       "2": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8tMDegQIAhAE"
      },
      // ... More pages ...
    ]
  },
  "searchInformation": {
    "totalResults": "37600000",
    "timeTaken": 0.47
  },
  "newsResults": [
    {
      "position": 1,
      "title": "De'Longhi's TrueBrew Coffee Maker Boasts Simplicity, but the Joe Is Just So-So",
      "link": "https://www.wired.com/review/delonghi-truebrew-drip-coffee-maker/",
      "source": "WIRED",
      "snippet": "The expensive coffee maker with Brad Pitt as its spokesmodel is better than a capsule-based machine but not as good as competing single-cup...",
      "date": "1 day ago"
    },
    // ... More news results ...
  ]
}

You can display the obtained data on the screen or continue working with it.

Parse the Data

To further process the data, we need to parse it. For this purpose, we explicitly specify that the data is stored in JSON format:

data = response.json()

Now we can use the attribute names to retrieve specific data:

news = data['newsResults']

Thus, we have put all the news into the news variable.

Export the Data to CSV

To save the obtained data as CSV or Excel file, we use Pandas. Using this library, we can create a data frame or an organized data set as a table from a JSON response.

df = pd.DataFrame(news)

The headings will be identical to the attribute names. Now let’s just save the dataframe to a file:

df.to_csv("news_result.csv", index=False)
df.to_excel("news_result.xlsx", index=False)

The result is a table like this:

Excel File

To make the code more reliable, let’s add try..except blocks and check for a successful response:

try:
    response = requests.get(api_url, params=params, headers=headers)
    if response.status_code == 200:
        data = response.json()
        news = data['newsResults']
        df = pd.DataFrame(news)
        df.to_excel("news_result.xlsx", index=False)
except Exception as e:
    print('Error:', e)

Thus, we got the data without the need to process HTML pages, use proxies or search for ways to bypass blocking and captchas.

Scrape Google News Results using Selenium

The next option for scraping Google News is to use Python libraries. In this case, it is worth using headless browsers to mimic the behavior of a real user to reduce the risk of blocking.

We will use Selenium to make Google News Scraper because it works with different programming languages and supports several web drivers. In this tutorial, we will use Chrome web driver.

Full Code: Google News Scraping via Selenium

As in the previous example, let’s start by looking at a ready-made script that you can simply copy and run on your PC:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

chromedriver_path = 'C://chromedriver.exe'
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)

url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'
driver.get(url)

news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div >div>div>div')
for news_div in news_results:
    news_item = []
    try:
        news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
        print("Link:", news_link)

        divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a>div>div>div')

        for new in divs_inside_news:
            news_item.append(new.text)
        print("Domain:", news_item[1])
        print("Title:", news_item[2])
        print("Description:", news_item[3])
        print("Date:", news_item[4])
        print("-"*50+"\n\n"+"-"*50)
    except Exception as e:
        print("No Elems")

driver.quit()

For those who want to dive into the details, we’ve prepared the next section where we’ll break down the process of creating this script step by step.

Install the Library and Download Webdriver

To install Selenium, type at the command prompt:

pip install selenium

Then go to the Chrome webdriver website and download the version you need (it should match the version of Google Chrome you have installed).

Research Google News Page Structure

Before writing the code, look at the Google News page and research the parts we will scrape. The first thing to look at is the link to the Google News page. Let’s go over and see what it looks like:

https://www.google.com/search?q=new+york+good+news&tbm=nws&tbm=nws

As we can see, we can easily compose scraping links by replacing “new york good news” with any other query.

Now let’s go to the developer tools (F12 or right-click on the screen and Inspect) and look at one of the results in more detail.

Items

All news has a div tag with id=“rso”. We can use this and the HTML page structure to get the needed data. To get the elements themselves, we can use the selector “div#rso > div > div > div > div > div > div” which gets the data in div tags.

In another situation, we would get the data from the elements using classes. This could be the “SoaBEf” class, which is common to all elements. However, class names in Google News change often and are not constant. Therefore, let’s rely on the structure and elements that will not change.

Tags

Here, as we can see, we can get the following data:

  1. The link to the news.

  2. The name of the resource where the news is posted.

  3. The headline of the news.

  4. A description of the news.

  5. How long ago the news was published.

Now that we know what data we need let’s move on to scraping.

Import Library and Set Parameters

Create a new file with *.py extension and corrupt the necessary Selenium library modules:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

Now let’s set the path to the previously downloaded web driver file and the link to the Google news page to be scraped.

chromedriver_path = 'C://chromedriver.exe'
url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'

We also need to specify the parameters of the webdriver to run.

service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)

This concludes the preparation, and you can move on to data collection.

Go to the Google News and Scrape Data

All we have to do is run the query and collect the data. To do this, run webdriver:

driver.get(url)

If you run the script now, you’ll have a Google Chrome window launch that navigates to the the search query.

Selenium WebDriver

Now let’s parse the content of the page we are on. For this purpose, we have previously investigated the web page and studied its structure. Now let’s use it and get all the news on the page:

news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div > div > div > div')

Then we go around each element one by one:

for news_div in news_results:

First, let’s collect the links and display them on the screen:

        news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
        print("Link:", news_link)

Then we get the remaining elements:

        divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a > div > div > div')
        news_item = []
        for new in divs_inside_news:
            news_item.append(new.text)

Now let’s display these values on the screen:

        print("Domain:", news_item[1])
        print("Title:", news_item[2])
        print("Description:", news_item[3])
        print("Date:", news_item[4])

Make a separator between the different news items so it’s visually apparent:

        print("-"*50+"\n\n"+"-"*50)

And finally, close the web driver.

driver.quit()

Now, if we run this script, we will get the data in the form:

Result

Even though we got the same information, let’s refine the script a bit further so that it doesn’t just display the data on the screen, but actually saves it to a file.

Export the Data to CSV

Actually, the process of saving the data will be the same as in the previous example. For this, we’ll need the pandas library again:

import pandas as pd

As we said before, using this library, we can create a data frame or an organized data set as a table from a JSON response.

df = pd.DataFrame(news_item)

The headings will be identical to the attribute names. Now let’s just save the dataframe to a file:

df.to_csv("news_result.csv", index=False)
df.to_excel("news_result.xlsx", index=False)

As a result, we’ll end up with exactly the same file as in the previous example.

Conclusion and Takeaways

This article discussed two ways to scrape Google News data using Python: using the Google News API and applying web scraping methods using the Selenium library. The Google News API offers a simple approach, providing data in JSON format that can be quickly processed and analyzed. Obtaining the API key and setting the parameters lets you quickly retrieve news information according to your requests.

Blog

Might Be Interesting