How to Scrape and Download Google Images: A Step-by-Step Guide
Web scraping is one of the most popular ways to collect data. It is an excellent substitute for manual data collection, eliminating the possibility of error, reducing time spent, and quickly providing vast amounts of data.
We’ve already talked a lot about how to collect data from online stores to track competitors, as well as how to scrape Google services like Maps and SERP (search engine results page). Today, we will discuss how you can quickly and efficiently collect images using web scraping.
Benefits of Image Scraping
Although Google Images scraping is not as much in demand as scraping other Google services, it can still be a necessity in some areas of business. For example, you can use scraped images to improve the quality of visual content and user engagement. It is also helpful in finding trending and relevant images. And, of course, it is a great way to collect large arrays of images if you want to create a variety of image galleries for websites or even for machine learning.
Get real-time access to Google search results, structured data, and more with our powerful SERP API. Streamline your development process with easy integration of our API. Start your free trial now!
Seamlessly access and extract images from Google's search results in real-time. Simplify your workflow with structured data, ensuring uninterrupted access without CAPTCHAs or blocks.
Tools for Web Scraping Google Images
If we talk about image scraping methods, we can divide them into two parts: manual and automated. Since manual scraping consists in manually downloading images, which requires a lot of time and effort, we will not dwell on this method.
Instead of that, let’s look at ways to automate scraping Google Images.
Creating a Custom Web Scraper
You can try creating your image scraper if you have basic programming skills. You can use almost any programming language for this purpose. If you want to learn which libraries to use and gain foundational skills, you can read our articles on web scraping with Python, NodeJS, PHP, Ruby, R, or C#. For example, you can easily collect data in Python using the Beautiful Soupб UrlLib or Selenium library with Webdriver. We’ve written before about how to scrape Google search results using these libraries and CSS selectors, and scraping images is not very different.
Although this might seem difficult initially, and you’ll encounter some challenges, there is a way to overcome them. For instance, you can use a web scraping API to simplify the scraping process and reduce the risk of being blocked.
Using Online Services
The second way to automate the Google Images scraping process is to use special web scraping services to collect the data. Here, you won’t need programming skills, but you will only be able to collect the data that the service provides. This means that if you don’t have the data you need among the provided data, this service will not be suitable for you.
Besides special services, there are also various plug-ins for browsers. However, they either do not have any flexibility or require customization, which requires programming skills.
Given that, we would like to minimize the need for programming skills but still get the most versatile solution with a wide range of functionality. Therefore, we will use Request Builder for Web Scraping API. This way, we will configure all the parameters, including localization, and execute requests on the site.
Get Image Data using HasData Request Builder
We’ll look at scraping images on Python using the HasData API later. However, we want to show how to get links to images without programming. For convenience, we have broken this process into steps.
Step 1: Obtain an API Key
Sign up on the site to get access to the Google SERP API and a personal account where you can fulfill queries. You will also get free credits, so you can try out the functionality even without paying.
An API key is on the dashboard tab.
We will need this API key in the following example.
Step 2: Go to the Google SERP API tab
To collect images, click on the Google SERP tab in your account.
This tool allows you to collect data from search results, images, news, locals, and shopping.
Step 3: Add Query Parameters for the Search
Now, look at the primary fields you can set up to customize your query. And the first thing to set up is geographic location.
You need to select the city or country you want to get results from. You can also specify a region code instead of a location name. Depending on your settings, the results will be different.
Then customize the localization. You don’t need to change anything if you are satisfied with the default settings.
The main parameters here are domain, country, and language, so you can only specify these. Then comes a block of advanced parameters. You can familiarize yourself with them and use them if necessary, but we would like to go further and consider the following necessary parameter.
To scrape Google Images, you should specify the appropriate search type (tbm) as Google Images (isch).
And the last parameter to specify is the keyword you want to get images for.
The other parameters are also quite helpful. However, their purpose is intuitive, so you can familiarize yourself with them if necessary. For example, using them, you can specify the number of images you want to get.
Step 4: Send the API Request
Now that all the parameters have been specified run the query.
Running one such request costs five credits. When you sign up, you get 1000 credits for free, equal to 200 such requests. Using our API allows you to use flexible settings to customize your request and also uses proxies, avoids captcha, and uses JS rendering to reduce the risk of blocking.
Step 5: Save the Scraped Image URLs
The result will be a JSON response with all available content, including the title and image link. You can copy the whole JSON response or just the necessary part.
To save this data as an Excel file, you can use the built-in data import or convert JSON to XLSX using special services. As a result, you will get the data in the following form:
As you can see from the table, you get image position, title, source, thumbnail, link, and even dimensions.
The Google SERP API library for Python is a comprehensive solution that allows developers to integrate Google Search Engine Results Page (SERP) data. It provides a simplified way to get organic search results, snippets, knowledge graph data, and other data from the Google search engine.
This easy-to-use interface lets you call the Google SERP API to efficiently scrape search engine results (SERPs) using Node.js. It simplifies retrieval of organic search results, snippets, knowledge graph data, and more from Google.
Easy Google Images Scraper on Python with API
Let’s look at how to scrape images using Python. We will use Python 3, and if you don’t know how to prepare the environment and configure everything you need, you can read our article: “Web Scraping with Python: from Fundamentals to Practice”.
Step 1: Install Necessary Libraries
First, we need to install the libraries. We need only the Requests library to scrape data from Google image searches. To install the Requests library, you can put it in the command prompt:
pip install requests
Now let’s move on to creating the script.
Step 2: Import Required Modules
Create a new file with the extension *.py in which we will write the script. Import Requests library in it:
import requests
We need the Requests library to execute a request to HasData API and get a response from the API.
Step 3: Set Up API Key and Parameters
Let’s customize the parameters we’re going to use. If it is possible to put any data into variables, it is better to do so in case we use them in several places. So, let’s start by creating variables where we will put the API endpoint and keyword.
keyword = 'Coffee'
api_url = 'https://api.hasdata.com/scrape/google'
Now let’s create a request header and put the API key we got in the last example.
headers = {'x-api-key': 'YOUR-API-KEY'}
The last step is to create the request’s body and enter the necessary parameters. In the ‘q’ parameter, we put the previously stored keyword in the variable. Next, specify the Google domain where the search will be performed. And the last parameter, ‘tbm’, should have the value ‘isch’ to search for images.
params = {
'q': keyword,
'domain': 'google.com',
'tbm': 'isch'
}
As we said in the last example, the parameters are much more significant. You can find all of them in our documentation.
Step 4: Create and Send API Request
Now we just need to gather all the above data into one query and execute it. Use the try..except block if the query is executed with an error.
try:
response = requests.get(api_url, params=params, headers=headers)
# Here will be the future code
except Exception as e:
print("Failed to make the API request:", e)
This approach allows you to continue code execution despite an error. Otherwise, program execution will be aborted.
Step 5: Parse and Process API Response
It’s easy to get and process a response, too. To do this, make sure the request returned a positive response. And then put the data in JSON format into a variable.
if response.status_code == 200:
# Parse the JSON response
data = response.json()
You can assign your actions that will be performed if the request is not successful, for example, output the response code. Or you can add nothing, then the code will be executed only in case of a positive response.
Step 6: Extract Image URLs and Data
Now make sure the script works correctly. Get a scroll of image names and links to them and then display them.
images_results = data['imagesResults']
for image in images_results:
print(image['title'],str(": "), image['original'])
Since the response is returned in JSON format, it is relatively easy to work with. Run the script and make sure everything is correct.
We have all the data we need to collect the images. Let’s proceed to the next step and save the gained images.
Step 7: Download Images
There are many ways to save an image, but we won’t dwell on them. Let’s use the already connected Requests library to save the data. First, let’s create an algorithm of what we need to do:
Create a folder in which to store the images. We can not do this, but then all images will be in a shared folder, which is inconvenient. So, we will create a folder that is identical to the keyword.
Get the extension of the image. Unfortunately, the Requests library requires that we specify the file format to be saved. Let’s get it from the «Original» field.
Set the file name. To do this, we use the title and file extension.
Save all files one by one.
We need two additional libraries to accomplish these tasks: re (to use regular expressions) and os (to handle system operations such as creating folders). They are pre-installed so that we can plug them into the script immediately.
import os
import re
Following the algorithm, let’s create a folder. To avoid errors, remove possible spaces and other characters:
folder_name = re.sub(r'[^\w\-]+', '_', keyword)
os.makedirs(folder_name, exist_ok=True)
Get the extension data and remove unnecessary characters from the title in every image:
images_results = data['imagesResults']
for image in images_results:
image_title = re.sub(r'[^\w\-]+', '_', image['title'])
image_url = image['original']
image_extension = image_url.split('.')[-1]
Set the file name and path:
image_file_name = f"{image_title}.{image_extension}"
image_path = os.path.join(folder_name, image_file_name)
Now let’s load the image and save it as a file. We will also display a message about if it was successful:
with open(image_path, "wb") as file:
image_response = requests.get(image_url)
if image_response.status_code == 200:
file.write(image_response.content)
print(f"Image '{image_title}' downloaded successfully.")
else:
print(f"Failed to download the image '{image_title}'. Status code:", image_response.status_code)
Run the script and check that everything works correctly:
As a result of the execution, we got a Coffee folder with all the resulting images:
Full code:
import requests
import os
import re
keyword = 'Coffee'
api_url = 'https://api.hasdata.com/scrape/google'
headers = {'x-api-key': 'YOUR-API-KEY'}
params = {
'q': keyword,
'domain': 'google.com',
'tbm': 'isch'
}
try:
response = requests.get(api_url, params=params, headers=headers)
if response.status_code == 200:
# Parse the JSON response
data = response.json()
# Create a folder with the keyword name
folder_name = re.sub(r'[^\w\-]+', '_', keyword)
os.makedirs(folder_name, exist_ok=True)
# Save the images to the folder
images_results = data['imagesResults']
for image in images_results:
print(image['title'],str(": "), image['original'])
try:
image_title = re.sub(r'[^\w\-]+', '_', image['title'])
image_url = image['original']
image_extension = image_url.split('.')[-1]
image_file_name = f"{image_title}.{image_extension}"
image_path = os.path.join(folder_name, image_file_name)
with open(image_path, "wb") as file:
image_response = requests.get(image_url)
if image_response.status_code == 200:
file.write(image_response.content)
print(f"Image '{image_title}' downloaded successfully.")
else:
print(f"Failed to download the image '{image_title}'. Status code:", image_response.status_code)
except Exception as e:
print("Failed to download the image: ", e)
else:
print("Failed to get the API response. Status code:", response.status_code)
except Exception as e:
print("Failed to make the API request:", e)
Using the API makes it quick and easy to retrieve large image samples and takes care of the tasks of captcha avoidance, proxy use, and blocking avoidance.
How to Scrape Google Images on Python with BeautifulSoup
Now, let’s see how to scrape images without using the API. We can use any scraping library for this, so let’s look at the most straightforward option - use the already-known Requests library and the BeautifulSoup library for page parsing data.
This option will not allow us to implement page switching or simulate user behavior, but the BeautifulSoup library is excellent for beginners and easy to learn. If you want to emulate a browser, look at the Selenium library. It uses a web driver, for example, chromedriver, and allows you to control a headless browser.
Step 1: Research Google Images Page
Before we move on to creating the script, we need to examine the page. We didn’t need to do this in the previous examples, as we used the API and got a ready and well-structured result.
Go to Google Images and find pictures for the query “Coffee”. Pay attention to the link structure, it will be helpful to get data not for one keyword but for the whole list.
We can use variables to create the necessary search query:
https://www.google.com/search?q={KEYWORD}&tbm=isch
Go to DevTools (F12 or right-click on the page and go to Inspect) and find the code that matches the image. Use the tool to search for an item on the page.
Here we can look at the code of each element and find the data we need. If we use the data stored in the img tag directly, we will get image previews. To get the original image, extract the javascript code, execute it, and parse the result. Since this is a basic example, we will retrieve image previews.
Step 2: Install Necessary Libraries
As we mentioned earlier, we will need the BeautifulSoup library. You can install it using this command:
pip install requests bs4
Let’s move on to creating the script.
Step 3: Navigate to Google Images
Plug in all the necessary libraries and set the variables. Here we use user-agent in the headers to reduce the risk of blocking. You can go to your browser and find your own or use ours.
import requests, re, os
from bs4 import BeautifulSoup
keyword = "coffee"
url = f"https://www.google.com/search?q={keyword}&tbm=isch"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
Then let’s create a try…catch block and execute the query in it:
try:
response = requests.get(url, headers=headers)
#Here will be code
except Exception as e:
print("Failed to make the request:", e)
Now let’s ensure the request is successful and create the necessary folder. We will take this block from the previous example, so we will not insert it again.
Step 4: Parse HTML Response with BeautifulSoup
To get specific elements from the HTML page, we use CSS selectors and the beautiful soup library:
images = soup.find_all('img')
img_src_list = [img['src'] for img in images if 'src' in img.attrs]
We have a list of links to image previews and can move on to the next step.
Step 5: Download Images
Let’s set a counter variable to number the saved images. Then go through the whole list of images in order and save them. The saving process is identical to the previous example.
сount = 0
for img in img_src_list:
count=count+1
image_path = os.path.join(folder_name, f"{str(count)}_{folder_name}.jpg")
As a result of running the script, we got a folder with previews of images.
Here’s the complete code, in case you’re having trouble:
import requests, re, os
from bs4 import BeautifulSoup
keyword = "coffee"
url = f"https://www.google.com/search?q={keyword}&tbm=isch"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
count = 0
folder_name = re.sub(r'[^\w\-]+', '_', keyword)
os.makedirs(folder_name, exist_ok=True)
soup = BeautifulSoup(response.content, 'html.parser')
images = soup.find_all('img')
img_src_list = [img['src'] for img in images if 'src' in img.attrs]
for img in img_src_list:
count=count+1
image_path = os.path.join(folder_name, f"{str(count)}_{folder_name}.jpg")
with open(image_path, "wb") as file:
try:
image_response = requests.get(img)
if image_response.status_code == 200:
file.write(image_response.content)
print(f"Image '{img}' downloaded successfully.")
else:
print(f"Failed to download the image '{img}'. Status code:", image_response.status_code)
except Exception as e:
print("Oops!")
else:
print("Failed to fetch the webpage. Status code:", response.status_code)
except Exception as e:
print("Failed to make the request:", e)
After running the script, we got only 20 images. Unfortunately, we can’t get more images using this library. To get more than 20 images, use the Selenium library. With its help, you can customize scrolling, so you will not be limited in how many images to download.
Conclusion and Takeaways
For scraping Google Images, you can use either an API or scraping libraries. Your choice should be based on your skills, goals, and tasks. Using an API for scraping is much easier. Plus, you won’t have to worry about blocks, CAPTCHA, or using proxies. On the other hand, if you’re good at programming, you can create a tool that fully addresses your tasks.
Using this step-by-step tutorial, you can try out the basics of scraping Google Images, choose a tool, and take your first step in gathering images. We’ve used simple examples so you can understand and learn the fundamentals.
Might Be Interesting
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Oct 16, 2024
Scrape Etsy.com Product, Shop and Search Results Data
Learn how to scrape Etsy product, shop, and search results data with methods like Requests, BeautifulSoup, Selenium, and web scraping APIs. Explore strategies for data extraction and storage from Etsy's platform.
- E-commerce
- Tutorials and guides
- Python
Sep 20, 2024
Easy Way to Get an Up-to-Date List of Retail Clothing Stores
Learn the easiest ways to get an up-to-date list of retail clothing stores, including methods like no-code scraping, using Google Maps, and exploring alternative sources for accurate retail data.
- Use Cases
- E-commerce