Web Scraping Google News using Python: Step-by-Step Guide
Discover how to leverage Google News and Python for an array of advanced applications with our comprehensive guide. Whether for market research, sentiment analysis or crisis management, these easy-to-implement techniques can help transform your approach to news gathering.
We’ll provide detailed instructions on using the Google SERP API and web scraping libraries such as Beautiful Soup and Selenium for automated information gathering. These methods allow you to explore more advanced use cases beyond catching up with today’s headlines. Discover an easier way to interact with news today!
Get instant access to structured Google News search results in real time, without blocks or CAPTCHAs with Google News API. Streamline your development process and get the freshest news headlines, articles, and more in HTML or JSON format.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Google News Scraping using API
There are two ways to extract news from Google search results: using a Python library for web scraping, or using the Google News API. The API option is a great choice for beginners and anyone who wants to avoid the hassle of dealing with blocking, captchas, and proxy rotation.
The Google News API gives you data in a JSON format, which is easy to process and work with. Let’s see how to scrape google news headlines and descriptions using the Google News API, what you need, and how to save the obtained data in Excel.
Sign Up and Get an API key
To use the API, you need an API key. To get it, go to the HasData website and sign up.
Go to the Dashboard tab in your account and copy your personal API key. We will need it later.
Set the Parameters
First, let’s install the necessary libraries. To do this, specify the following in the command prompt:
pip install requests
pip install pandas
The Requests library is a request library that will allow us to request the API to get the necessary data. And the Pandas library is needed to process the data and then save it as an Excel file.
Now that the libraries are installed create a file with *.py extension and import them.
import requests
import pandas as pd
Now let’s set the parameters that can be put into variables. There are only two of them: a reference to the API endpoint and a keyword.
keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'
The last thing to set is the headers and body of the request. The header contains only one parameter - the API key. But the request body can contain many parameters, including localization parameters. The full list of parameters can be found in our documentation.
In this example, we will use only the necessary parameters:
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
'q': keyword,
'domain': 'google.com',
'tbm': 'nws'
}
We specified the keyword, domain, and type. The remaining parameters can be left unspecified, but they can be used to fine-tune the query and get more specific results.
Make a Request
Now that all the necessary parameters are specified execute the request:
response = requests.get(api_url, params=params, headers=headers)
HasData’s Google News API uses a GET request and provides a JSON response in the following format:
{
"requestMetadata": {
"id": "57239e2b-02a2-4bfb-878d-9c36f5c21798",
"googleUrl": "https://www.google.com/search?q=Coffee&uule=w+CAIQICIaQXVzdGluLFRleGFzLFVuaXRlZCBTdGF0ZXM%3D&gl=us&hl=en&filter=1&tbm=nws&oq=Coffee&sourceid=chrome&num=10&ie=UTF-8",
"googleHtmlFile": "https://storage.googleapis.com/scrapeit-cloud-screenshots/57239e2b-02a2-4bfb-878d-9c36f5c21798.html",
"status": "ok"
},
"pagination": {
"next": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8NMDegQIAhAW",
"current": 1,
"pages": [
{
"2": "https://www.google.com/search?q=Coffee&gl=us&hl=en&tbm=nws&ei=sim9ZPe7Noit5NoP3_efgAU&start=10&sa=N&ved=2ahUKEwj33Jes9qSAAxWIFlkFHd_7B1AQ8tMDegQIAhAE"
},
// ... More pages ...
]
},
"searchInformation": {
"totalResults": "37600000",
"timeTaken": 0.47
},
"newsResults": [
{
"position": 1,
"title": "De'Longhi's TrueBrew Coffee Maker Boasts Simplicity, but the Joe Is Just So-So",
"link": "https://www.wired.com/review/delonghi-truebrew-drip-coffee-maker/",
"source": "WIRED",
"snippet": "The expensive coffee maker with Brad Pitt as its spokesmodel is better than a capsule-based machine but not as good as competing single-cup...",
"date": "1 day ago"
},
// ... More news results ...
]
}
You can display the obtained data on the screen or continue working with it.
Parse the Data
To further process the data, we need to parse it. For this purpose, we explicitly specify that the data is stored in JSON format:
data = response.json()
Now we can use the attribute names to retrieve specific data:
news = data['newsResults']
Thus, we have put all the news into the news variable.
Save the Gathered Data
To save the obtained data as an Excel file, we use Pandas. Using this library, we can create a data frame or an organized data set as a table from a JSON response.
df = pd.DataFrame(news)
The headings will be identical to the attribute names. Now let’s just save the dataframe to a file:
df.to_excel("news_result.xlsx", index=False)
The result is a table like this:
To make the code more reliable, let’s add try..except blocks and check for a successful response. Resulting code:
import requests
import pandas as pd
keyword = 'new york good news'
api_url = 'https://api.hasdata.com/scrape/google'
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
'q': keyword,
'domain': 'google.com',
'tbm': 'nws'
}
try:
response = requests.get(api_url, params=params, headers=headers)
if response.status_code == 200:
data = response.json()
news = data['newsResults']
df = pd.DataFrame(news)
df.to_excel("news_result.xlsx", index=False)
except Exception as e:
print('Error:', e)
Thus, we got the data without the need to process HTML pages, use proxies or search for ways to bypass blocking and captchas.
Scrape Google News Results using Selenium
The next option for scraping Google News is to use Python libraries. In this case, it is worth using headless browsers to mimic the behavior of a real user to reduce the risk of blocking.
We will use Selenium to make Google News Scraper because it works with different programming languages and supports several web drivers. In this tutorial, we will use Chrome web driver.
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML or JSON format.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Install the Library and Download Webdriver
To install Selenium, type at the command prompt:
pip install selenium
Then go to the Chrome webdriver website and download the version you need (it should match the version of Google Chrome you have installed).
Research Google News Page Structure
Before writing the code, look at the Google News page and research the parts we will scrape. The first thing to look at is the link to the Google News page. Let’s go over and see what it looks like.
As we can see, we can easily compose scraping links by replacing “new york good news” with any other query.
Now let’s go to the developer tools (F12 or right-click on the screen and Inspect) and look at one of the results in more detail.
All news has a div tag with id=“rso”. We can use this and the HTML page structure to get the needed data. To get the elements themselves, we can use the selector “div#rso > div > div > div > div > div > div” which gets the data in div tags.
In another situation, we would get the data from the elements using classes. This could be the “SoaBEf” class, which is common to all elements. However, class names in Google News change often and are not constant. Therefore, let’s rely on the structure and elements that will not change.
Here, as we can see, we can get the following data:
-
The link to the news.
-
The name of the resource where the news is posted.
-
The headline of the news.
-
A description of the news.
-
How long ago the news was published.
Now that we know what data we need let’s move on to scraping.
Import Library and Set Parameters
Create a new file with *.py extension and corrupt the necessary Selenium library modules:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
Now let’s set the path to the previously downloaded web driver file and the link to the Google news page to be scraped.
chromedriver_path = 'C://chromedriver.exe'
url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'
We also need to specify the parameters of the webdriver to run.
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)
This concludes the preparation, and you can move on to data collection.
Go to the Google News and Scrape Data
All we have to do is run the query and collect the data. To do this, run webdriver:
driver.get(url)
If you run the script now, you’ll have a Google Chrome window launch that navigates to the the search query.
Now let’s parse the content of the page we are on. For this purpose, we have previously investigated the web page and studied its structure. Now let’s use it and get all the news on the page:
news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div > div > div > div')
Then we go around each element one by one:
for news_div in news_results:
First, let’s collect the links and display them on the screen:
news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
print("Link:", news_link)
Then we get the remaining elements:
divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a > div > div > div')
news_item = []
for new in divs_inside_news:
news_item.append(new.text)
Now let’s display these values on the screen:
print("Domain:", news_item[1])
print("Title:", news_item[2])
print("Description:", news_item[3])
print("Date:", news_item[4])
Make a separator between the different news items so it’s visually apparent:
print("-"*50+"\n\n"+"-"*50)
And finally, close the web driver.
driver.quit()
Now, if we run this script, we will get the data in the form:
Full code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
chromedriver_path = 'C://chromedriver.exe'
service = Service(chromedriver_path)
driver = webdriver.Chrome(service=service)
url = 'https://www.google.com/search?q=new+york+good+news&tbm=nws'
driver.get(url)
news_results = driver.find_elements(By.CSS_SELECTOR, 'div#rso > div >div>div>div')
for news_div in news_results:
news_item = []
try:
news_link = news_div.find_element(By.TAG_NAME, 'a').get_attribute('href')
print("Link:", news_link)
divs_inside_news = news_div.find_elements(By.CSS_SELECTOR, 'a>div>div>div')
for new in divs_inside_news:
news_item.append(new.text)
print("Domain:", news_item[1])
print("Title:", news_item[2])
print("Description:", news_item[3])
print("Date:", news_item[4])
print("-"*50+"\n\n"+"-"*50)
except Exception as e:
print("No Elems")
driver.quit()
If you want to save the data in Excel format, you can enter it line by line or, as in the previous option, create a dataframe and save it all at once.
Conclusion and Takeaways
This article discussed two ways to scrape Google News data using Python: using the Google News API and applying web scraping methods using the Selenium library. The Google News API offers a simple approach, providing data in JSON format that can be quickly processed and analyzed. Obtaining the API key and setting the parameters lets you quickly retrieve news information according to your requests.
For those who need more control and flexibility, web scraping with Selenium can be an alternative. By mimicking the user’s behavior and interaction with a web page, Selenium allows you to extract specific data elements. This method is proper when more complex interactions with a web page are required, such as filling in fields.
The article described a step-by-step process for both methods and provided code samples showing how to use each method to retrieve Google News data. By following the instructions and code snippets, you can get a clear idea of how to collect news data from Google News for machine learning or your analytical and research needs.
Might Be Interesting
Sep 20, 2024
Easy Way to Get an Up-to-Date List of Retail Clothing Stores
Learn the easiest ways to get an up-to-date list of retail clothing stores, including methods like no-code scraping, using Google Maps, and exploring alternative sources for accurate retail data.
- Use Cases
- E-commerce
Sep 16, 2024
How to Easily Copy Data from Any Shopify Store to Your Own
Learn how to transfer data from any Shopify store to your own easily. This guide covers everything from understanding Shopify data and its formats to exporting and importing data using simple tools, including a no-code scraper.
- Use Cases
- E-commerce
Sep 9, 2024
How to Scrape Immobilienscout24.de Real Estate Data
Learn how to scrape real estate data from Immobilienscout24.de with step-by-step instructions, covering website analysis, choosing the right tools, and storing the collected data.
- Real Estate
- Use Cases
- Python