How to Scrape YouTube Data for Free: A Complete Guide

Valentina Skakun Valentina Skakun
Last update: 29 Oct 2024

YouTube is the world’s leading video platform, hosting millions of creators and viewers every day. But besides watching videos, scraping YouTube data offers a range of powerful benefits for content creators and marketers. It allows you to track emerging trends, assess the performance of your own videos, and generate new ideas for future content. 

By analyzing data from other channels, creators can identify opportunities to innovate or enhance their material. Scraping comments, in particular, can provide deeper insight into audience reactions, helping creators make changes to increase engagement and improve content based on real feedback.

Preparing for YouTube Scraping

In this article, we will provide you with several Python scripts for scraping data from YouTube. If you are new to Python, you can either read our earlier article on Python scraping for beginners or simply use the ready-made scripts in Google Colaboratory, where we will provide all the examples. This approach allows you to learn how to create scripts and also to run them directly in the cloud.

Methods for Scraping YouTube Data

Before we dive into examples, let’s explore the different methods for collecting data from YouTube, focusing on a few key techniques that we will cover in this article:

  1. Manual data collection. As mentioned, you can visit the YouTube page of interest and gather data manually. However, this method is not suitable for continuous monitoring due to its slow speed and the constant changes on the platform.

  2. Using the YouTube API. To simplify interactions with the video hosting service and avoid scraping, the YouTube API was developed for developers. We will discuss this API in detail in this article.

  3. Using plugins or scraping services. If you prefer to obtain ready-to-use data immediately, you can explore existing solutions, such as plugins or online services designed for scraping YouTube data.

  4. Scraping YouTube directly. This can be done using specialized libraries or any other tool that enables scraping, such as Selenium, Pyppeteer, or Playwright.

We will examine a few of these methods, focusing on the YouTube API as the most convenient and officially supported approach. We will also discuss how to scrape YouTube data directly if using the official YouTube API doesn’t meet your needs. 

Types of Data You Can Extract from YouTube

Because of YouTube’s popularity as a platform, it’s important to consider what specific data we can extract from its pages. We can broadly categorize this data into several groups:

  1. Video page.

  2. Channel page.

  3. Search results page.

  4. Playlists.

Let’s examine each of these categories, starting with the video page, to see what information we can extract.

YouTube video example

YouTube video example

From the video page, you can obtain the download link for the video, its title, description, view count, like count, and comment count, as well as the publication date and the comments themselves. 

Next, let’s move on to the channel page:

YouTube channel example

YouTube channel example

In addition to recommendations, popular videos, and other features, the channel page provides information such as the channel name, owner, subscriber count, and total number of videos.

Next we can navigate to the search results page to find relevant videos:

YouTube search results example

YouTube search results example

From this page, we can get a list of videos that match the specified filters, along with a brief description of each video. Depending on the selected filters, this may include relevant videos, channels, playlists, and movies.

Lastly, let’s look at playlists: 

YouTube playlist example

YouTube playlist example

On the playlist page, we can obtain essential information about the playlist, including its title, author, the number of videos included, view count, and description, as well as a list of videos with brief descriptions.

Some of the methods we will discuss in this article will not allow you to access all of the datapoints we’ve just looked at. We will discuss the pros and cons of each method for Youtube scraping and provide our recommendations for each scraping solution.

Using YouTube Data API

Before we look at the main requests for the YouTube API, it’s important to note that you’ll need an API key to access the service. Instructions on how to obtain this key can be found in the official YouTube documentation. It is important to keep in mind that Google imposes limits on the number of requests you can make, which are outlined in the YouTube API Quotas section.

Each request for channel, video, playlist, or comment data consumes 1 quota point. However, queries using the search method are more resource-intensive, costing 100 quota points per request. Google assigns each API key a daily quota of 10,000 points, which is typically sufficient for most use cases, making the API effectively free for moderate usage.

When working with the API, it’s important to understand how many results each request will return:

  1. For playlists, you can retrieve between 0 and 50 items, with a default of 5.

  2. For search queries, the result count ranges from 0 to 50, with a default of 5.

  3. The API can return 1 to 100 results for comment threads, with a default of 20.

  4. For channel data, you can fetch between 0 and 50 results, with the default set at 5.

  5. For video data, requests generally return information for the specified video only.

This quota system allows developers to manage their API usage, balancing performance and cost-effectiveness.

In this section, we will explore the primary methods for interacting with the YouTube API and write small Python scripts to automate the requests and save the necessary data. Generally speaking, the API can be accessed using the following URL (only the methods will change):

https://www.googleapis.com/youtube/v3

To help you navigate this, we will provide links to Google Colaboratory, where you can find and try out ready-made scripts:

  1. YouTube Channel Information. This script allows you to retrieve and save data for a specific channel of interest.

  2. YouTube Video Data. This script provides essential information about a video you are interested in.

  3. YouTube Playlist Data. This script enables you to obtain information about a playlist you want to explore quickly.

  4. YouTube Comments Data. This is suitable for tracking comments, including replies to existing ones.

  5. YouTube Search Data. With this script, you can retrieve up-to-date information with a list of relevant videos based on your query.

You can find a complete list of all of the API methods on the official YouTube API page.

YouTube Channel Scraping

This API is ideal for quickly gathering data about a specific YouTube channel. The Channels method parameters page provides a complete list of parameters that this endpoint can return.

You can specify the channel that you want to gather data on in two ways: by its username or by its channel ID.

  1. Using the username:
https://www.googleapis.com/youtube/v3/channels?part=param1,param2,param3&forUsername=channelName&key=yourAPIkey

2. Using the channel ID:

https://www.googleapis.com/youtube/v3/channels?part=param1,param2,param3&id=channelID&key=yourAPIkey

To use these endpoints, replace param1, param2, and param3 with the specific parameters you wish to retrieve, provide either the channel’s username or ID, and include your API key.

Next, let’s write a Python script that will make a request to the API, display the main parameters, and save the retrieved data to a file. Start by creating a new file with the *.py extension and import the necessary libraries:

import requests
import json

Then, define the variables and parameters that will vary in your script:

API_KEY = 'YouTube-API-key'
CHANNEL_ID = 'UCK8sQmJBp8GCxrOtXWBpyEA'
USERNAME = 'Google'
PARTS = 'snippet,contentDetails,statistics'


choice = 'id'  # Choose: 'id' or 'username'

For convenience, we’ve listed the parameters we want to get and separated the variable type depending on how we want to identify the channel — by its ID or username.

Now, we’ll construct the URL using the parameters defined in the variables and make the API request:

if choice == 'id':
    url = f'https://www.googleapis.com/youtube/v3/channels?part={PARTS}&id={CHANNEL_ID}&key={API_KEY}'
else:
    url = f'https://www.googleapis.com/youtube/v3/channels?part={PARTS}&forUsername={USERNAME}&key={API_KEY}'


response = requests.get(url).json()

At this point, you could save the received JSON response to a file:

with open('channel_data.json', 'w', encoding='utf-8') as f:
    json.dump(response, f, ensure_ascii=False, indent=4)

However, we will further process the received response to extract the essential data:

item = response['items'][0] if 'items' in response and response['items'] else {}
snippet = item.get('snippet', {})
statistics = item.get('statistics', {})


channel_info = {
    "Channel Title": snippet.get('title', 'N/A'),
    "Description": snippet.get('description', 'N/A'),
    "Published Date": snippet.get('publishedAt', 'N/A'),
    "Subscriber Count": statistics.get('subscriberCount', 'N/A'),
    "Total Views": statistics.get('viewCount', 'N/A'),
    "Video Count": statistics.get('videoCount', 'N/A'),
    "Custom URL": snippet.get('customUrl', 'N/A'),
    "Thumbnails": snippet.get('thumbnails', {}).get('high', {}).get('url', 'N/A')
}

Here, we’ve accounted for the possibility that some data might be missing. To prevent errors, we’ve assigned a default value of “N/A” when certain information is not found.

Next, let’s print the results to the screen:

print(json.dumps(channel_info, indent=4, ensure_ascii=False))

You can also save the final data:

with open('channel_info.json', 'w', encoding='utf-8') as f:
    json.dump(channel_info, f, ensure_ascii=False, indent=4)

For example, using the channel we selected, you would receive the following response:

YouTube channel API example

In the future, you may want to enhance the script to collect data from a list of channels stored in a file instead of one channel. To do this, you would need to add functionality for reading data from a file and place the entire script within a loop. 

YouTube Video Scraping

This API is suitable for extracting information about specific YouTube videos. The Videos method parameters page provides a complete list of parameters that this endpoint can return.

To retrieve data about a video, you need to use its ID. A base request looks like this:

https://www.googleapis.com/youtube/v3/videos?part=param1,param2,param3&id=bEkZ6H9QY2s&key=yourAPIkey

In this request, the parameters specify which information you want to retrieve. Make sure to replace yourAPIkey with your API key and the ID is the unique identifier of the video.

Now, let’s write a Python script to send a request to the API, extract the key parameters, and save the retrieved data to a file. First, create a new file with a .py extension and import the necessary libraries:

import requests
import json

Next, set the variables and required parameters:

API_KEY = 'YouTube-API-key'
VIDEO_ID = 'bEkZ6H9QY2s'
PARTS = 'snippet,contentDetails,statistics'

Now, let’s construct the URL using the parameters defined above and send the request to the API:

url = f'https://www.googleapis.com/youtube/v3/videos?part={PARTS}&id={VIDEO_ID}&key={API_KEY}'
response = requests.get(url).json()

At this point, you can save the received JSON response to a file:

with open('video_data.json', 'w', encoding='utf-8') as f:
    json.dump(response, f, ensure_ascii=False, indent=4)

Next, let’s process the response to extract the main data:

item = response['items'][0] if 'items' in response and response['items'] else {}
snippet = item.get('snippet', {})
statistics = item.get('statistics', {})


video_info = {
    "Video Title": snippet.get('title', 'N/A'),
    "Description": snippet.get('description', 'N/A'),
    "Published Date": snippet.get('publishedAt', 'N/A'),
    "Channel Title": snippet.get('channelTitle', 'N/A'),
    "View Count": statistics.get('viewCount', 'N/A'),
    "Like Count": statistics.get('likeCount', 'N/A'),
    "Comment Count": statistics.get('commentCount', 'N/A'),
    "Duration": item.get('contentDetails', {}).get('duration', 'N/A'),
    "Thumbnails": snippet.get('thumbnails', {}).get('high', {}).get('url', 'N/A')
}

Remember that some data may be unavailable. In such cases, we assign the value “N/A” to avoid errors.

To display the results on the screen, use:

print(json.dumps(video_info, indent=4, ensure_ascii=False))

You can also save the final data:

with open('video_info.json', 'w', encoding='utf-8') as f:
    json.dump(video_info, f, ensure_ascii=False, indent=4)

For example, for the video mentioned, you will receive a response containing the following data about the video:

YouTube video API example

As with the channel scraping script, you can expand the script to gather data for more than one video, just make sure to store all video IDs in a file. To do this, start by adding code to read data from the file and wrap the entire script in a loop.

YouTube Playlists Scraping

The YouTube API also allows you to get data about playlists. The basic API request with this method looks as follows:

https://www.googleapis.com/youtube/v3/playlistItems?part=param1,param2,param3&playlistId=anyID&maxResults=50&key=yourAPIkey

If you don’t specify any parameters, the API response will look like this:

{
  "kind": "youtube#playlistItemListResponse",
  "etag": "8YBlUmh3hh1o6hfGp6VKPSip4c0",
  "nextPageToken": "EAAaelBUOkNESWlFRVUyTnpsRV",
  "items": [
    {
      "kind": "youtube#playlistItem",
      "etag": "Ji81mNleGlxKn3lh1Q4m_BBmaNE",
      "id": "VVVLOHNRbUpCcDhHQ3hyT3RYV0JweUVBLnk3Z0tsenZnOHhr"
    },
    {
      "kind": "youtube#playlistItem",
      "etag": "Bd32XLO__JWPbHYc07p6NpLvIAo",
      "id": "VVVLOHNRbUpCcDhHQ3hyT3RYV0JweUVBLkNWeFlUV21NME13"
    },
...
    {
      "kind": "youtube#playlistItem",
      "etag": "q8UZIIamS2NGOJBQx4gRvhM_9g0",
      "id": "VVVLOHNRbUpCcDhHQ3hyT3RYV0JweUVBLkstRUlBc29mRWRr"
    }
  ],
  "pageInfo": {
    "totalResults": 2098,
    "resultsPerPage": 50
  }
}

In this response, the items array contains individual playlist items, each of which includes data about the videos within the playlist.

Let’s write a Python script that will make a request to the API, display key playlist parameters, and save the retrieved data to a file. Start by creating a new file with the *.py extension and importing the necessary libraries:

import requests
import json

Next, define your variables and set the parameters:

API_KEY = 'YouTube-API-key'
PLAYLIST_ID = 'UUK8sQmJBp8GCxrOtXWBpyEA'
PARTS = 'snippet,contentDetails'
MAX_RESULTS = 50

Construct the URL using the parameters defined above and make the API request:

url = f'https://www.googleapis.com/youtube/v3/playlistItems?part={PARTS}&playlistId={PLAYLIST_ID}&maxResults={MAX_RESULTS}&key={API_KEY}'
response = requests.get(url).json()

At this point, you can save the JSON response to a file:

with open('playlist_data.json', 'w', encoding='utf-8') as f:
    json.dump(response, f, ensure_ascii=False, indent=4)

Next, let’s further process the response to extract essential data about the videos in the playlist:

items = response.get('items', [])
playlist_items = []


for item in items:
    snippet = item.get('snippet', {})
    content_details = item.get('contentDetails', {})


    video_info = {
        "Title": snippet.get('title', 'N/A'),
        "Description": snippet.get('description', 'N/A'),
        "Published At": snippet.get('publishedAt', 'N/A'),
        "Video ID": content_details.get('videoId', 'N/A'),
        "Channel Title": snippet.get('channelTitle', 'N/A'),
        "Thumbnails": snippet.get('thumbnails', {}).get('high', {}).get('url', 'N/A')
    }


    playlist_items.append(video_info)

In this script, we iterate through each playlist item in “items,” extracting the desired data and assigning “N/A” if any information is missing.

We can then print the results to the console:

print(json.dumps(playlist_items, indent=4, ensure_ascii=False))

In this example, we limited the number of videos to five:

YouTube playlist API example

You can save the final data to another file:

with open('playlist_info.json', 'w', encoding='utf-8') as f:
    json.dump(playlist_items, f, ensure_ascii=False, indent=4)

Doing this will allow you to receive structured data about each video in the selected playlist. In the future, you could enhance the script’s functionality by adding the ability to extract data from multiple playlists using a file containing a list of video IDs.

YouTube Comments Scraping

This method allows you to scrape video comments from any YouTube video, providing information about both the comments themselves and the users who posted them. Below is an example of an API request to obtain comments by video ID. The complete list of parameters can be found on the CommentThreads method documentation page.

Example Request:

https://www.googleapis.com/youtube/v3/commentThreads?part=param1,param2,param3&videoId=anyID&key=yourAPIkey

Replace “videoId” with the ID of the video you want to fetch comments from and specify your YouTube Data API access key along with the parameters you wish to retrieve.

Next, let’s write a Python script to execute the API request, display key comment parameters, and save the retrieved data to a file. First, we need to import the necessary libraries:

import requests
import json

Now, let’s define the variables and parameters:

API_KEY = ' YouTube-API-key'
VIDEO_ID = 'bEkZ6H9QY2s'
PARTS = 'snippet'

Next, we will construct the URL for the request and execute it:

url = f'https://www.googleapis.com/youtube/v3/commentThreads?part={PARTS}&videoId={VIDEO_ID}&key={API_KEY}'
response = requests.get(url).json()

Save the retrieved data to a file:

with open('comments_data.json', 'w', encoding='utf-8') as f:
    json.dump(response, f, ensure_ascii=False, indent=4)

Now, let’s extract the essential data only from the response. We will select the top comments and display the following information: the comment text, author, like count, and publication date.

comments_info = []


for item in response.get('items', []):
    comment = item['snippet']['topLevelComment']['snippet']
    comments_info.append({
        "Author": comment.get('authorDisplayName', 'N/A'),
        "Comment": comment.get('textDisplay', 'N/A'),
        "Likes": comment.get('likeCount', 0),
        "Published At": comment.get('publishedAt', 'N/A')
    })

As with previous examples, we’ve accounted for the possibility of missing information. If data is absent, we will display “N/A.”

print(json.dumps(comments_info, indent=4, ensure_ascii=False))

Finally, let’s save the final data to a new file:

with open('comments_info.json', 'w', encoding='utf-8') as f:
    json.dump(comments_info, f, ensure_ascii=False, indent=4)

As a result, you will receive the comments data for the selected video in JSON format. 

YouTube comments API example

As with the previous scripts, you can enhance this one to handle multiple videos by using a file containing a list of video IDs.

YouTube Search Results Scraping

The YouTube Search Results method allows you to retrieve a list of videos from YouTube based on a specified query. This method is particularly useful for analyzing popular videos, setting up regular trend tracking, or monitoring changes in video rankings on a specific topic. For example, you can request videos related to coffee, specifying the maximum number of results and the region.

A typical request will look like this:

https://www.googleapis.com/youtube/v3/search?part=param1,param2,param3&q=keyword&regionCode=US&type=video&maxResults=10&key=yourAPIkey

Now, let’s write a Python script that executes the API request, retrieves the main parameters for each video found, and saves them to a file. First, create a new file with a *.py extension and import the necessary libraries:

import requests
import json

Next, define the required parameters:

API_KEY = 'YouTube-API-key'
QUERY = 'Coffee'
REGION_CODE = 'US'
MAX_RESULTS = 10
PARTS = 'snippet'

Construct the API request URL and handle the response:

url = f'https://www.googleapis.com/youtube/v3/search?part={PARTS}&q={QUERY}&regionCode={REGION_CODE}&type=video&maxResults={MAX_RESULTS}&key={API_KEY}'
response = requests.get(url).json()

To save the entire response to a file, you can use the following code:

with open('search_results.json', 'w', encoding='utf-8') as f:
    json.dump(response, f, ensure_ascii=False, indent=4)

Let’s enhance the script to scrape the only data related to YouTube video, process it, and output the results:

items = response['items'] if 'items' in response else []


search_results = []


for item in items:
    snippet = item.get('snippet', {})
    video_id = item.get('id', {}).get('videoId', 'N/A')
    result_info = {
        "Video Title": snippet.get('title', 'N/A'),
        "Channel Title": snippet.get('channelTitle', 'N/A'),
        "Published Date": snippet.get('publishedAt', 'N/A'),
        "Video ID": video_id,
        "Description": snippet.get('description', 'N/A'),
        "Thumbnails": snippet.get('thumbnails', {}).get('high', {}).get('url', 'N/A')
    }
    search_results.append(result_info)


print(json.dumps(search_results, indent=4, ensure_ascii=False))

In our example, the output will resemble the following format:

YouTube search results API example

We can also save the final data to a file:

with open('parsed_search_results.json', 'w', encoding='utf-8') as f:
    json.dump(search_results, f, ensure_ascii=False, indent=4)

This script can be easily enhanced to handle multiple search results pages by adding support for nextPageToken to fetch additional sets of YouTube videos.

Web Scraping with yt-dlp

You can use the yt-dlp library, a fork of the now-defunct youtube-dl, to download videos and audio from various websites, including YouTube. It is actively maintained and packed with features, making it a powerful and flexible tool for content downloading and data scraping.

This library is popular not only for its simplicity but also for its support of numerous sites, including Vimeo, Twitch, TikTok, Instagram, and many others. This makes it an excellent tool for scraping data from multiple platforms.

However, there are some limitations to using this tool for scraping YouTube.

Firstly, while yt-dlp can effectively scrape YouTube videos and their primary metadata (such as title, uploader, and duration), it is less efficient for in-depth analytics (like video comments, mentions, and trends). In these cases, the YouTube Data API is more suitable, as it provides access to much more detailed data.

Secondly, although yt-dlp can sometimes bypass CAPTCHA, this is not guaranteed. If YouTube detects suspicious activity, it may start displaying CAPTCHAs or other forms of verification, complicating the library’s ability to function without additional measures, such as using proxy servers.

Lastly, unlike the YouTube Data API, which has clearly defined quotas, yt-dlp lacks mechanisms for managing request quotas or limits. This makes it less predictable for long-term usage and not always suitable for large-scale YouTube video data collection.

You should keep these limitations in mind before starting.  We will now go over how to get started with yt-dlp, first you need to install it:

pip install yt-dlp

To scrape YouTube comments data, you’ll also need another library:

pip install YoutubeCommentDownloader

Using this library is generally simpler than accessing the API. As a final example, we’ll combine data collection from YouTube into a single script using Google Colaboratory.

Create a Python file with a *.py extension and import the necessary libraries:

import yt_dlp
from youtube_comment_downloader import YoutubeCommentDownloader

We will now loop through different types of YouTube pages and gather the required data, starting with a standard video page. Assign the video link and configure the library options:

video_url = 'https://www.youtube.com/watch?v=bEkZ6H9QY2s'
ydl_opts = {'extract_flat': True}

In this case, we specified that we only want video metadata without downloading the file itself. If you set this to False, you can download the video in one of the available formats later.

Now, let’s extract the video information and specify the parameters we want to display: 

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    info_dict = ydl.extract_info(video_url, download=False)
    video_info = {
        "title": info_dict.get('title', 'N/A'),
        "uploader": info_dict.get('uploader', 'N/A'),
        "upload_date": info_dict.get('upload_date', 'N/A'),
        "views": info_dict.get('view_count', 'N/A'),
        "likes": info_dict.get('like_count', 'N/A'),
        "dislikes": info_dict.get('dislike_count', 'N/A'),
        "duration": info_dict.get('duration', 'N/A'),
        "description": info_dict.get('description', 'N/A'),
        "tags": info_dict.get('tags', [])
    }

Optionally, you can print the results to the console:

print(video_info)

You will receive data structured similarly to this:

YouTube video yt-dlp example

Next, let’s retrieve data about a channel. We will use similar code but specify the channel URL and desired parameters:

channel_url = 'https://www.youtube.com/c/Google'


with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    info_dict = ydl.extract_info(channel_url, download=False)
    channel_info = {
        "channel_name": info_dict.get('uploader', 'N/A'),
        "channel_id": info_dict.get('channel_id', 'N/A'),
        "description": info_dict.get('description', 'N/A'),
        "channel_url": info_dict.get('channel_url', 'N/A'),
        "channel_views": info_dict.get('view_count', 'N/A'),
        "subscriber_count": info_dict.get('subscriber_count', 'N/A'),
        "total_videos": info_dict.get('playlist_count', 'N/A')
    }


print(channel_info)

Your results will be channel data similar to the following:

YouTube channel yt-dlp example

Now, we will change the URL and parameters to scrape YouTube playlist data:

playlist_url = 'https://www.youtube.com/playlist?list=UUK8sQmJBp8GCxrOtXWBpyEA'


with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    info_dict = ydl.extract_info(playlist_url, download=False)
    playlist_info = {
        "playlist_title": info_dict.get('title', 'N/A'),
        "playlist_id": info_dict.get('id', 'N/A'),
        "playlist_url": info_dict.get('webpage_url', 'N/A'),
        "total_videos": len(info_dict.get('entries', []))
    }


print(playlist_info)

The output will resemble this:

YouTube playlist yt-dlp example

Finally, let’s scrape YouTube search results data:

query = 'Coffee'
search_url = f'ytsearch10:{query}'


with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    search_results = ydl.extract_info(search_url, download=False)
    videos = []


    for entry in search_results['entries']:
        videos.append({
            "title": entry.get('title', 'N/A'),
            "url": entry.get('url', 'N/A'),
            "duration": entry.get('duration', 'N/A'),
            "views": entry.get('view_count', 'N/A')
        })


print(videos)

The output will look like this:

YouTube search results yt-dlp example

Lastly, we’ll scrape YouTube comments for specific videos. As mentioned, we will use the YoutubeCommentDownloader library that we imported earlier. Make sure to assign the “video_url”:

video_url = 'https://www.youtube.com/watch?v=bEkZ6H9QY2s'

Now, let’s obtain the list of comments for this video:

downloader = YoutubeCommentDownloader()
comments = []


for comment in downloader.get_comments_from_url(video_url):
    comments.append({
        "author": comment['author'],
        "text": comment['text'],
        "likes": comment['votes'],
        "time": comment['time']
    })

Finally, print the first five comments:

print(comments[:5])

You will receive a list like this:

YouTube comments scraping example

To explore the available variables, methods, and parameters in more detail, please refer to the official documentation.

Scraping YouTube Data Using Selenium

Using Selenium for scraping YouTube is the most complex option that we will be discussing. The general process will be the same regardless of the page you want to scrape. The steps are as follows:

  1. Initialize the web driver and navigate to the desired page.

  2. Retrieve the HTML source of the page.

  3. Parse the page and extract data using CSS selectors.

  4. Output or save the data, then close the web driver.

The process for handling all pages is identical (only the page URL and CSS selectors will differ).  In the following example we will only focus on the video page in detail. This method does not change and is much less popular than using the YouTube Data API or yt-dlp.

Additionally, please note that Google Colaboratory does not support executing scripts that utilize a web driver, so running this script in the cloud will not be possible.

First, let’s revisit the video page and carefully examine the elements we want to scrape:

YouTube CSS selectors example

How to find CSS selectors of elements

Let’s identify the selectors for the main elements on the page:

  1. Title: h1.ytd-watch-metadata

  2. Likes: div.yt-spec-touch-feedback-shape

  3. Channel Name: div.ytd-channel-name

  4. Subscribers: #owner-sub-count

  5. Views Info: #ytd-watch-info-text

  6. Description: #description-inline-expander

  7. Comments: ytd-comment-thread-renderer

    1. Comment Author: #author-text span

    2. Comment Content: #content-text

    3. Comment Timestamp: #published-time-text

    4. Comment Link: #published-time-text

If you have not worked with Selenium before, you can find detailed information and examples in our introductory article about using Selenium with Python.

First, let’s import the necessary libraries and modules:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import time

Next, we’ll configure the web driver and navigate to the page. We will add a delay to allow the page to load:

chrome_options = Options()
driver = webdriver.Chrome(options=chrome_options)


url = "https://www.youtube.com/watch?v=GSJrfo3veMg"
driver.get(url)


time.sleep(10)

We can either extract the necessary data directly or encapsulate the extraction logic in a separate function to reduce code duplication:

def safe_find_element(by, value):
    try:
        return driver.find_element(by, value)
    except NoSuchElementException:
        return None

Let’s define the data extraction using the previously mentioned selectors:

title = safe_find_element(By.CSS_SELECTOR, 'h1.ytd-watch-metadata')
title_text = title.text if title else "Title not found"


likes = safe_find_element(By.CSS_SELECTOR, 'div.yt-spec-touch-feedback-shape')
likes_text = likes.text if likes else "Likes not found"


channel_info = safe_find_element(By.CSS_SELECTOR, 'div.ytd-channel-name')
channel_name = channel_info.text if channel_info else "Channel name not found"


subscribers = safe_find_element(By.ID, 'owner-sub-count')
subscribers_text = subscribers.text.split(' ')[0] if subscribers else "Subscribers not found"


views_info = safe_find_element(By.ID, 'ytd-watch-info-text')
views_views_text = views_info.text.split(' views')[0].strip() if views_info else "Views and date not found"
views_date_text = views_info.text.split(' views')[1].strip() if views_info else "Views and date not found"


description = safe_find_element(By.ID, 'description-inline-expander')
description_text = description.text if description else "Description not found"


comments_data = []
comments = driver.find_elements(By.CSS_SELECTOR, 'ytd-comment-thread-renderer')
for comment in comments:
    author = comment.find_element(By.CSS_SELECTOR, '#author-text span').text
    content = comment.find_element(By.CSS_SELECTOR, '#content-text').text
    timestamp = comment.find_element(By.CSS_SELECTOR, '#published-time-text a').text
    comment_link = comment.find_element(By.CSS_SELECTOR, '#published-time-text a').get_attribute('href')


    comments_data.append({
        'author': author,
        'content': content,
        'timestamp': timestamp,
        'link': comment_link
    })

Finally, we can print the extracted data and close the web driver:

print(f"Title: {title_text}")
print(f"Likes: {likes_text}")
print(f"Channel: {channel_name}")
print(f"Subscribers: {' '.join(subscribers_text)}")
print(f"Views: {views_views_text}")
print(f"Date: {views_date_text}")
print(f"Description: {description_text}")
print("Comments:")
for comment in comments_data:
    print(f"Author: {comment['author']}, Comment: {comment['content']}, Posted: {comment['timestamp']}")


driver.quit()

As a result, we will receive the data in the following format:

YouTube video scraping exmple

If needed, you can also add additional variables for any other selectors you may require or scrape other YouTube pages using the same algorithm.

Conclusion and Takeaways

Scraping data from YouTube is a useful tool for gathering a variety of information for analysis, trend tracking, and generating new ideas for your YouTube videos. There are various approaches to data extraction, ranging from manual collection to using the official API, extensions, and libraries, such as Selenium. Each method has its advantages and disadvantages, which are important to consider when choosing the most suitable option for your needs.

The choice of scraping method depends on your goals, the volume of data, and your programming preferences. Choosing between the API, yt-dlp, and Selenium is like picking camping gear for different trips. The API is like a tent in a box — it’s quick to set up and does the job. yt-dlp is a full camping kit, with tools and gadgets for multiple setups, and Selenium is like bringing your own logs, hammer, and tarp; you’ll have to build the campsite yourself! Both yt-dlp and Selenium are versatile tools, but they come with unique challenges.

With the yt-dlp library you will not be able to extract in-depth information like video comments, mentions, and trends. CAPTCHAs may also become an issue with this form of scraping. 

With Selenium you will need to carefully select the proper selectors and this can prove to be an overly complex in implementation.  

Given this, we recommend using the YouTube Data API. It is the most reliable and secure way to obtain information about channels, videos, playlists, and comments. Additionally, since this is an official API, the platform allows this method.

For your convenience, these examples are also available in Google Colaboratory, allowing you to run them in the cloud without needing to install Python on your PC.

We hope that the examples and recommendations provided assist you in successfully extracting data from YouTube!

Blog

Might Be Interesting