Web Scraping Google News using Python: Step-by-Step Guide
Google News is the biggest news aggregator out there, available in 141 countries and offering content in 41 languages. By 2021, it had over 1.5 billion users, making it one of the go-to platforms for keeping up with the latest news or doing market research.
In this article, we’ll show you how to build your own Google News scraper that automatically collects fresh news and helps you track topics you’re interested in. We’ll guide you through using the Google SERP API and web scraping tools like Beautiful Soup and Selenium for easy, automated data collection. These methods open up a world of possibilities beyond just staying updated on the news. Discover an easier way to stay in the know!
Get instant access to structured Google News search results in real time, without blocks or CAPTCHAs with Google News API. Streamline your development process and get the freshest news headlines, articles, and more in HTML or JSON format.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
How to Scrape Google News using API
There are two ways to extract news from Google search results: using a Python library for web scraping, or using the Google News API. The API option is a great choice for beginners and anyone who wants to avoid the hassle of dealing with blocking, captchas, and proxy rotation.
The Google News API gives you data in a JSON format, which is easy to process and work with. Let’s see how to scrape google news headlines and descriptions using the Google News API, what you need, and how to save the obtained data in Excel.
Sign Up and Get an API key
To use the API, you need an API key. To get it, go to the HasData website and sign up.
Go to the Dashboard tab in your account and copy your personal API key. We will need it later.
Full Code: Google News Scraping with API
Before we dive into the step-by-step process of creating the script, here’s a ready-made example for those who just want the end result right away:
import requests
import pandas as pd
keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
'q': keyword,
'domain': 'google.com',
'tbm': 'nws'
}
try:
response = requests.get(api_url, params=params, headers=headers)
if response.status_code == 200:
data = response.json()
news = data['newsResults']
df = pd.DataFrame(news)
df.to_excel("news_result.xlsx", index=False)
df.to_csv("news_result.csv", index=False)
except Exception as e:
print('Error:', e)
If you’re eager to test the script immediately and aren’t interested in diving into the details of its creation, feel free to jump over to Google Colaboratory. Just enter your API key and keyword, and run the script right away. However, if you’d like to go through the step-by-step process of building the script, feel free to continue to the next section, where I’ll explain everything in detail.
Set the Parameters
First, let’s install the necessary libraries. To do this, specify the following in the command prompt:
pip install requests
pip install pandas
The Requests library is a request library that will allow us to request the API to get the necessary data. And the Pandas library is needed to process the data and then save it as an Excel file.
Now that the libraries are installed create a file with *.py extension and import them.
import requests
import pandas as pd
Now let’s set the parameters that can be put into variables. There are only two of them: a reference to the API endpoint and a keyword.
keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'
The last thing to set is the headers and body of the request. The header contains only one parameter - the API key. But the request body can contain many parameters, including localization parameters. The full list of parameters can be found in our documentation.
In this example, we will use only the necessary parameters:
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
'q': keyword,
'domain': 'google.com',
'tbm': 'nws'
}
We specified the keyword, domain, and type. The remaining parameters can be left unspecified, but they can be used to fine-tune the query and get more specific results.
Make a Request
Now that all the necessary parameters are specified execute the request:
response = requests.get(api_url, params=params, headers=headers)
HasData’s Google News API uses a GET request and provides a JSON response in the following format:
{
"requestMetadata": {
"id": "57239e2b-02a2-4bfb-878d-9c36f5c21798",
"googleUrl": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIaQXVzdGluLFRleGFzLFVuaXRlZCBTdGF0ZXM%3D&gl=us&hl=en&filter=1&tbm=nws&oq=Coffee&sourceid=chrome&num=10&ie=UTF-8",
"googleHtmlFile": "https://storage.googleapis.com/scrapeit-cloud-screenshots/57239e2b-02a2-4bfb-878d-9c36f5c21798.html",
"status": "ok"
},
"pagination": {
"next": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8NMDegQIAhAW",
"current": 1,
"pages": [
{
"2": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8tMDegQIAhAE"
},
// ... More pages ...
]
},
"searchInformation": {
"totalResults": "37600000",
"timeTaken": 0.47
},
"newsResults": [
{
"position": 1,
"title": "De'Longhi's TrueBrew Coffee Maker Boasts Simplicity, but the Joe Is Just So-So",
"link": "https://www.wired.com/review/delonghi-truebrew-drip-coffee-maker/",
"source": "WIRED",
"snippet": "The expensive coffee maker with Brad Pitt as its spokesmodel is better than a capsule-based machine but not as good as competing single-cup...",
"date": "1 day ago"
},
// ... More news results ...
]
}
You can display the obtained data on the screen or continue working with it.
Parse the Data
To further process the data, we need to parse it. For this purpose, we explicitly specify that the data is stored in JSON format:
data = response.json()
Now we can use the attribute names to retrieve specific data:
news = data['newsResults']
Thus, we have put all the news into the news variable.
Export the Data to CSV
To save the obtained data as CSV or Excel file, we use Pandas. Using this library, we can create a data frame or an organized data set as a table from a JSON response.
df = pd.DataFrame(news)
The headings will be identical to the attribute names. Now let’s just save the dataframe to a file:
df.to_csv("news_result.csv", index=False)
df.to_excel("news_result.xlsx", index=False)
The result is a table like this:
To make the code more reliable, let’s add try..except blocks and check for a successful response:
try:
response = requests.get(api_url, params=params, headers=headers)
if response.status_code == 200:
data = response.json()
news = data['newsResults']
df = pd.DataFrame(news)
df.to_excel("news_result.xlsx", index=False)
except Exception as e:
print('Error:', e)
Thus, we got the data without the need to process HTML pages, use proxies or search for ways to bypass blocking and captchas.
Scrape Google News Results using Selenium
The next option for scraping Google News is to use Python libraries. In this case, it is worth using headless browsers to mimic the behavior of a real user to reduce the risk of blocking.
We will use Selenium to make Google News Scraper because it works with different programming languages and supports several web drivers. In this tutorial, we will use Chrome web driver.
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Full Code: Google News Scraping via Selenium
As in the previous example, let’s start by looking at a ready-made script that you can simply copy and run on your PC:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
chromedriver_path = 'C://chromedriver.exe'
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)
url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'
driver.get(url)
news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div >div>div>div')
for news_div in news_results:
news_item = []
try:
news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
print("Link:", news_link)
divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a>div>div>div')
for new in divs_inside_news:
news_item.append(new.text)
print("Domain:", news_item[1])
print("Title:", news_item[2])
print("Description:", news_item[3])
print("Date:", news_item[4])
print("-"*50+"\n\n"+"-"*50)
except Exception as e:
print("No Elems")
driver.quit()
For those who want to dive into the details, we’ve prepared the next section where we’ll break down the process of creating this script step by step.
Install the Library and Download Webdriver
To install Selenium, type at the command prompt:
pip install selenium
Then go to the Chrome webdriver website and download the version you need (it should match the version of Google Chrome you have installed).
Research Google News Page Structure
Before writing the code, look at the Google News page and research the parts we will scrape. The first thing to look at is the link to the Google News page. Let’s go over and see what it looks like:
https://www.google.com/search?q=new+york+good+news&tbm=nws&tbm=nws
As we can see, we can easily compose scraping links by replacing “new york good news” with any other query.
Now let’s go to the developer tools (F12 or right-click on the screen and Inspect) and look at one of the results in more detail.
All news has a div tag with id=“rso”. We can use this and the HTML page structure to get the needed data. To get the elements themselves, we can use the selector “div#rso > div > div > div > div > div > div” which gets the data in div tags.
In another situation, we would get the data from the elements using classes. This could be the “SoaBEf” class, which is common to all elements. However, class names in Google News change often and are not constant. Therefore, let’s rely on the structure and elements that will not change.
Here, as we can see, we can get the following data:
- The link to the news.
- The name of the resource where the news is posted.
- The headline of the news.
- A description of the news.
- How long ago the news was published.
Now that we know what data we need let’s move on to scraping.
Import Library and Set Parameters
Create a new file with *.py extension and corrupt the necessary Selenium library modules:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
Now let’s set the path to the previously downloaded web driver file and the link to the Google news page to be scraped.
chromedriver_path = 'C://chromedriver.exe'
url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'
We also need to specify the parameters of the webdriver to run.
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)
This concludes the preparation, and you can move on to data collection.
Go to the Google News and Scrape Data
All we have to do is run the query and collect the data. To do this, run webdriver:
driver.get(url)
If you run the script now, you’ll have a Google Chrome window launch that navigates to the the search query.
Now let’s parse the content of the page we are on. For this purpose, we have previously investigated the web page and studied its structure. Now let’s use it and get all the news on the page:
news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div > div > div > div')
Then we go around each element one by one:
for news_div in news_results:
First, let’s collect the links and display them on the screen:
news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
print("Link:", news_link)
Then we get the remaining elements:
divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a > div > div > div')
news_item = []
for new in divs_inside_news:
news_item.append(new.text)
Now let’s display these values on the screen:
print("Domain:", news_item[1])
print("Title:", news_item[2])
print("Description:", news_item[3])
print("Date:", news_item[4])
Make a separator between the different news items so it’s visually apparent:
print("-"*50+"\n\n"+"-"*50)
And finally, close the web driver.
driver.quit()
Now, if we run this script, we will get the data in the form:
Even though we got the same information, let’s refine the script a bit further so that it doesn’t just display the data on the screen, but actually saves it to a file.
Export the Data to CSV
Actually, the process of saving the data will be the same as in the previous example. For this, we’ll need the pandas library again:
import pandas as pd
As we said before, using this library, we can create a data frame or an organized data set as a table from a JSON response.
df = pd.DataFrame(news_item)
The headings will be identical to the attribute names. Now let’s just save the dataframe to a file:
df.to_csv("news_result.csv", index=False)
df.to_excel("news_result.xlsx", index=False)
As a result, we’ll end up with exactly the same file as in the previous example.
Conclusion and Takeaways
This article discussed two ways to scrape Google News data using Python: using the Google News API and applying web scraping methods using the Selenium library. The Google News API offers a simple approach, providing data in JSON format that can be quickly processed and analyzed. Obtaining the API key and setting the parameters lets you quickly retrieve news information according to your requests.
Might Be Interesting
Dec 6, 2024
XPath vs CSS Selectors: Pick Your Best Tool
Explore the key differences between CSS selectors and XPath, comparing their advantages, limitations, and use cases. Learn about performance, syntax, flexibility, and how to test and build selectors for web development.
- Basics
- Use Cases
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Oct 16, 2024
Scrape Etsy.com Product, Shop and Search Results Data
Learn how to scrape Etsy product, shop, and search results data with methods like Requests, BeautifulSoup, Selenium, and web scraping APIs. Explore strategies for data extraction and storage from Etsy's platform.
- E-commerce
- Tutorials and guides
- Python