Web Scraping Google News using Python: Step-by-Step Guide

Web Scraping Google News using Python: Step-by-Step Guide
Last edit: Apr 30, 2024

Discover how to leverage Google News and Python for an array of advanced applications with our comprehensive guide. Whether for market research, sentiment analysis or crisis management, these easy-to-implement techniques can help transform your approach to news gathering.

We'll provide detailed instructions on using the Google SERP API and web scraping libraries such as Beautiful Soup and Selenium for automated information gathering. These methods allow you to explore more advanced use cases beyond catching up with today’s headlines. Discover an easier way to interact with news today!

Try Our Ready-Made Solutions for Your Needs

Get instant access to structured Google News search results in real time, without blocks or CAPTCHAs. Streamline your development process and get the freshest news…

Get fast, real-time access to structured Google search results with our SERP API. No blocks or CAPTCHAs - ever. Streamline your development process without worrying…

Google News Scraping using API

There are two ways to extract news from Google search results: using a Python library for web scraping, or using the Google News API. The API option is a great choice for beginners and anyone who wants to avoid the hassle of dealing with blocking, captchas, and proxy rotation.

The Google News API gives you data in a JSON format, which is easy to process and work with. Let's see how to scrape google news headlines and descriptions using the Google News API, what you need, and how to save the obtained data in Excel.

Sign Up and Get an API key

To use the API, you need an API key. To get it, go to the HasData website and sign up.

Dashboard

Go to the Dashboard tab in your account and copy your personal API key. We will need it later.

Set the Parameters

First, let's install the necessary libraries. To do this, specify the following in the command prompt:

pip install requests
pip install pandas

The Requests library is a request library that will allow us to request the API to get the necessary data. And the Pandas library is needed to process the data and then save it as an Excel file.

Now that the libraries are installed create a file with *.py extension and import them.

import requests
import pandas as pd

Now let's set the parameters that can be put into variables. There are only two of them: a reference to the API endpoint and a keyword.

keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'

The last thing to set is the headers and body of the request. The header contains only one parameter - the API key. But the request body can contain many parameters, including localization parameters. The full list of parameters can be found in our documentation.

In this example, we will use only the necessary parameters:

headers = {'x-api-key': 'YOUR-API-KEY'}

params = {
    'q': keyword,
    'domain': 'google.com',
    'tbm': 'nws'
}

We specified the keyword, domain, and type. The remaining parameters can be left unspecified, but they can be used to fine-tune the query and get more specific results.

Make a Request

Now that all the necessary parameters are specified execute the request:

response = requests.get(api_url, params=params, headers=headers)

HasData's Google News API uses a GET request and provides a JSON response in the following format:

{
  "requestMetadata": {
    "id": "57239e2b-02a2-4bfb-878d-9c36f5c21798",
    "googleUrl": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIaQXVzdGluLFRleGFzLFVuaXRlZCBTdGF0ZXM%3D&gl=us&hl=en&filter=1&tbm=nws&oq=Coffee&sourceid=chrome&num=10&ie=UTF-8",
    "googleHtmlFile": "https://storage.googleapis.com/scrapeit-cloud-screenshots/57239e2b-02a2-4bfb-878d-9c36f5c21798.html",
    "status": "ok"
  },
  "pagination": {
   "next": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8NMDegQIAhAW",
    "current": 1,
    "pages": [
      {
       "2": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8tMDegQIAhAE"
      },
      // ... More pages ...
    ]
  },
  "searchInformation": {
    "totalResults": "37600000",
    "timeTaken": 0.47
  },
  "newsResults": [
    {
      "position": 1,
      "title": "De'Longhi's TrueBrew Coffee Maker Boasts Simplicity, but the Joe Is Just So-So",
      "link": "https://www.wired.com/review/delonghi-truebrew-drip-coffee-maker/",
      "source": "WIRED",
      "snippet": "The expensive coffee maker with Brad Pitt as its spokesmodel is better than a capsule-based machine but not as good as competing single-cup...",
      "date": "1 day ago"
    },
    // ... More news results ...
  ]
}

You can display the obtained data on the screen or continue working with it.

Parse the Data

To further process the data, we need to parse it. For this purpose, we explicitly specify that the data is stored in JSON format:

data = response.json()

Now we can use the attribute names to retrieve specific data:

news = data['newsResults']

Thus, we have put all the news into the news variable.

Save the Gathered Data

To save the obtained data as an Excel file, we use Pandas. Using this library, we can create a data frame or an organized data set as a table from a JSON response.

df = pd.DataFrame(news)

The headings will be identical to the attribute names. Now let's just save the dataframe to a file:

df.to_excel("news_result.xlsx", index=False)

The result is a table like this:

Excel File

To make the code more reliable, let's add try..except blocks and check for a successful response. Resulting code:

import requests
import pandas as pd

keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
    'q': keyword,
    'domain': 'google.com',
    'tbm': 'nws'
}

try:
    response = requests.get(api_url, params=params, headers=headers)
    if response.status_code == 200:
        data = response.json()
        news = data['newsResults']
        df = pd.DataFrame(news)
        df.to_excel("news_result.xlsx", index=False)
except Exception as e:
    print('Error:', e)

Thus, we got the data without the need to process HTML pages, use proxies or search for ways to bypass blocking and captchas.

Scrape Google News Results using Selenium

The next option for scraping Google News is to use Python libraries. In this case, it is worth using headless browsers to mimic the behavior of a real user to reduce the risk of blocking.

We will use Selenium to make Google News Scraper because it works with different programming languages and supports several web drivers. In this tutorial, we will use Chrome web driver.

Try Our Ready-Made Solutions for Your Needs

Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML…

Get fast, real-time access to structured Google search results with our SERP API. No blocks or CAPTCHAs - ever. Streamline your development process without worrying…

Install the Library and Download Webdriver

To install Selenium, type at the command prompt:

pip install selenium

Then go to the Chrome webdriver website and download the version you need (it should match the version of Google Chrome you have installed).

Research Google News Page Structure

Before writing the code, look at the Google News page and research the parts we will scrape. The first thing to look at is the link to the Google News page. Let's go over and see what it looks like.

Google News Link

As we can see, we can easily compose scraping links by replacing "new york good news" with any other query.

Now let's go to the developer tools (F12 or right-click on the screen and Inspect) and look at one of the results in more detail.

Items

All news has a div tag with id="rso". We can use this and the HTML page structure to get the needed data. To get the elements themselves, we can use the selector "div#rso > div > div > div > div > div > div" which gets the data in div tags.

In another situation, we would get the data from the elements using classes. This could be the "SoaBEf" class, which is common to all elements. However, class names in Google News change often and are not constant. Therefore, let's rely on the structure and elements that will not change.

Tags

Here, as we can see, we can get the following data:

  1. The link to the news.
  2. The name of the resource where the news is posted.
  3. The headline of the news.
  4. A description of the news.
  5. How long ago the news was published.

Now that we know what data we need let's move on to scraping.

Import Library and Set Parameters

Create a new file with *.py extension and corrupt the necessary Selenium library modules:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

Now let's set the path to the previously downloaded web driver file and the link to the Google news page to be scraped.

chromedriver_path = 'C://chromedriver.exe'
url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'

We also need to specify the parameters of the webdriver to run.

service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)

This concludes the preparation, and you can move on to data collection.

Go to the Google News and Scrape Data

All we have to do is run the query and collect the data. To do this, run webdriver:

driver.get(url)

If you run the script now, you'll have a Google Chrome window launch that navigates to the the search query.

Selenium WebDriver

Now let's parse the content of the page we are on. For this purpose, we have previously investigated the web page and studied its structure. Now let's use it and get all the news on the page:

news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div > div > div > div')

Then we go around each element one by one:

for news_div in news_results:

First, let's collect the links and display them on the screen:

        news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
        print("Link:", news_link)

Then we get the remaining elements:

        divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a > div > div > div')
        news_item = []
        for new in divs_inside_news:
            news_item.append(new.text)

Now let's display these values on the screen:

        print("Domain:", news_item[1])
        print("Title:", news_item[2])
        print("Description:", news_item[3])
        print("Date:", news_item[4])

Make a separator between the different news items so it's visually apparent:

        print("-"*50+"\n\n"+"-"*50)

And finally, close the web driver.

driver.quit()

Now, if we run this script, we will get the data in the form:

Result

Full code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

chromedriver_path = 'C://chromedriver.exe'
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)

url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'
driver.get(url)

news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div >div>div>div')
for news_div in news_results:
    news_item = []
    try:
        news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
        print("Link:", news_link)

        divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a>div>div>div')

        for new in divs_inside_news:
            news_item.append(new.text)
        print("Domain:", news_item[1])
        print("Title:", news_item[2])
        print("Description:", news_item[3])
        print("Date:", news_item[4])
        print("-"*50+"\n\n"+"-"*50)
    except Exception as e:
        print("No Elems")

driver.quit()

If you want to save the data in Excel format, you can enter it line by line or, as in the previous option, create a dataframe and save it all at once.

Conclusion and Takeaways

This article discussed two ways to scrape Google News data using Python: using the Google News API and applying web scraping methods using the Selenium library. The Google News API offers a simple approach, providing data in JSON format that can be quickly processed and analyzed. Obtaining the API key and setting the parameters lets you quickly retrieve news information according to your requests.

For those who need more control and flexibility, web scraping with Selenium can be an alternative. By mimicking the user's behavior and interaction with a web page, Selenium allows you to extract specific data elements. This method is proper when more complex interactions with a web page are required, such as filling in fields.

The article described a step-by-step process for both methods and provided code samples showing how to use each method to retrieve Google News data. By following the instructions and code snippets, you can get a clear idea of how to collect news data from Google News for machine learning or your analytical and research needs.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Collect structured data without any coding!

Our no-code scrapers make it easy to extract data from popular websites with just a few clicks.

  • CSV, XLSX, and JSON Formats
  • No Coding or Software Required
  • Save Time and Effort
Scrape with No Code
Valentina Skakun

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.