How to Scrape LinkedIn with Python

How to Scrape LinkedIn with Python
Last edit: May 30, 2024

LinkedIn, the world's largest professional social networking site with 1 billion members in more than 200 countries and territories worldwide. It is a valuable resource for both businesses and individuals, as it provided a platform for networking, finding jobs, and learning about new industries.

In this article, we'll delve into LinkedIn data scraping and explore the methods and tools for extracting this valuable information. We'll provide step-by-step guides and ready-to-use tools in Google Colaboratory, empowering you to harness the power of LinkedIn data for your specific needs.

Try Our Ready-Made Solutions for Your Needs

Extract job listings and associated details from Indeed.com with ease. No coding needed! Download your data in JSON, CSV, and Excel formats for easy analysis, research…

Our Glassdoor Scraper tool will gather Glassdoor job data including salaries, company size, employees, industries, location, logo, organization type, investors,…

LinkedIn Structure and Data Objects

As we said before, LinkedIn is a professional networking platform where individuals can create profiles showcasing their skills, experiences, and professional accomplishments. It's widely used by job seekers, recruiters, and professionals looking to connect with others in their field.

Before delving into the data collection process, let's explore LinkedIn in more detail and identify the information we can extract. Here are some of the primary data points available on LinkedIn:

  1. User Profiles: These profiles provide comprehensive information about users, including their name, location, current and past work experiences, educational background, skills, and endorsements.
  2. Job Listings: LinkedIn's job search feature offers advanced filters to narrow down results to the most relevant opportunities. Data extractable from job listings includes company details, location, job descriptions, and specific requirements for candidates.
  3. LinkedIn Learning: This platform offers many online courses and video tutorials. Extractable data includes course titles, descriptions, instructors, and links to course materials.
  4. LinkedIn Articles: The Collaborative Articles section features articles written by users and experts, categorized by topic or author. Data extraction can capture article titles, authors, content summaries, and publication dates.

Scraping various platform elements provides quick access to valuable data about users and companies. User data can assist in identifying qualified and experienced professionals for recruitment purposes, and company data can aid in finding potential clients, partners, or future employment opportunities.

Types of LinkedIn scrapers

Before scraping data, it's crucial to determine the appropriate method for data acquisition. Based on the chosen approach, there are two primary types of LinkedIn scrapers:

  1. Proxy-based LinkedIn scrapers. These scrapers utilize a pool of proxies to mask your IP address and avoid detection by LinkedIn. This method allows for large-scale data extraction without risking account suspension. However, it may be unable to access specific data due to LinkedIn limitations.
  2. Cookie-based LinkedIn scrapers. These scrapers leverage your existing LinkedIn session cookies to mimic your browsing activity. This method offers targeted data collection and access to personalized information. However, it relies on your account, which could lead to an account suspension if LinkedIn detects suspicious activity.

In this article, we'll delve into the details of each method, providing code examples to illustrate their implementation. We'll also discuss the advantages and limitations of each approach to help you choose the most suitable one for your specific scraping needs.

Prerequisites

To use the examples in this article, you will need Python 3.10 or higher, excluding a LinkedIn account. A Python IDE is recommended for a more streamlined experience, but any code editor with syntax highlighting and Python installed will suffice.

A virtual environment is optional if you don't want to enhance the security of your scraping. We've already covered how to set up and use it in the Python scraping article.

We'll employ several libraries in this article. Install them using the package manager:

pip install beautifulsoup4 requests selenium requests_oauthlib json

You may also need a web driver to use Selenium. The Selenium scraping article provides instructions on where to find and use it. 

Ways to Extract Data from LinkedIn

Let's explore the different methods for extracting data from LinkedIn. Regardless of the chosen method, the scraping process remains unchanged except for using the LinkedIn API. The only difference lies in the initial script for obtaining the page's source code. 

Using LinkedIn API

Like many other platforms with large datasets, LinkedIn offers its API for data access. However, this method has both advantages and disadvantages.

For instance, using LinkedIn guarantees easy and quick access to any necessary data in a convenient format. However, working with the LinkedIn API can be more challenging than initially anticipated. The initial hurdle is setting up a server to generate a token, which can be quite intricate without the necessary expertise.

Furthermore, LinkedIn imposes daily request limits on API calls. This can lead to scalability issues for applications that frequently access the API.

In terms of the data retrievable through the API, not all LinkedIn data is extractable via the API. For instance, accessing the list of available job openings is not permitted. Moreover, as an individual developer, you'll be required to specify a test company during app creation, and using such a company restricts access to certain endpoints.

If you still decide to try this method, proceed to the developer page and create your application. To obtain the required keys, you'll need to fill out a form, after which you can find your personal "Client ID" and "Client Secret" in the "Auth" section.

Create a new Python project, import the necessary libraries, and provide the obtained credentials: 

from requests_oauthlib import OAuth2Session

cl_id = 'PUT-YOUR-CLIENT-ID'
cl_secret = 'PUT-YOUR-CLIENT-SECRET'

We will then create variables to store the necessary links for obtaining tokens and authorization:

redirect_url = 'http://localhost:8080/callback'
base_url = 'https://www.linkedin.com/oauth/v2/authorization'
token_url = 'https://www.linkedin.com/oauth/v2/accessToken'

Open the session:

linkedin = OAuth2Session(cl_id, redirect_uri=redirect_url)

Generate an authorization link and display it on the screen:

authorization_url, state = linkedin.authorization_url(base_url)
print('Go to:', authorization_url)

Next, you'll need to follow the link in your browser manually. Alternatively, you can integrate Selenium to automate the token retrieval process. Once you have the token, return the full link to the script:

response = input('Put full URL: ')

Then get the token:

linkedin.fetch_token(token_url, client_secret=cl_secret, authorization_response=response)

With the token you've received, you can now access the data you need using the LinkedIn API. The comprehensive LinkedIn API documentation provides detailed information about the various endpoints available.

Using Requests and BeautifulSoup

Another option for obtaining data is to use a regular request library and proxies or cookies. As we mentioned earlier, as long as LinkedIn doesn't find your activity suspicious, you can get any publicly available data without authorization. However, in this case, the risk of being blocked is high.

On the other hand, you can constantly change proxies to avoid blocking your real IP address. Or you can use your real cookies and headers to make your requests less suspicious. Let's consider both options and start with using proxies.

We have already written in detail about how to use proxies with the Requests library in Python, so we will not dwell on this in detail and will give you a ready-made example of executing a request with a proxy:

import requests

url ="https://www.linkedin.com/jobs/search?position=1&pageNum=0"
proxies = {
    'http': 'http://45.95.147.106:8080',
    'https': 'https://37.187.17.89:3128'
}

response = requests.get(url, proxies=proxies)

You can set up proxy rotation yourself, or use a proxy server. You can find a list of reliable proxy providers in our top 10 residential proxy providers.

If you want to use your actual cookies, go to the LinkedIn page, log in, and go to DevTools (F12 or right-click and Inspect). Then go to the Network tab and find the desired field:

Find cookies
Copy your cookie

In addition to the headers mentioned above, you should include some other headers, including a valid User-Agent string. You can use our example, but you should replace the placeholder values with your cookies and the latest User-Agent string before using it.

cookies = "YOUR-COOKIES"

headers = {
  'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
  'Connection':'keep-alive',
  'accept-encoding': 'gzip, deflate, br',
  'Referer':'http://www.linkedin.com/',
  'accept-language': 'en-US,en;q=0.9',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
  'Cookie': cookies
}

When making a request, be sure to include these headers:

response = requests.get(url, headers=headers)

You are free to choose either of the options under consideration, as both will receive the necessary data.

Using Profiles in Selenium 

Another option is to use Selenium profiles. You can use an existing browser profile where you are logged into LinkedIn or create a new profile and automate the login process using the Selenium library functionality.

You can also simply log in using Selenium without using profiles. However, in this case, you will have to do this each time you run your script.

Therefore, it is easier to log in to a specific profile once and then simply collect data. In this case, the authorization will be saved after restarting the script. It also allows you to manually log in to your account using Google and then use the profile with the completed authorization.

To do this, import the library:

from selenium import webdriver

Specify the profile path and set web driver options:

profile_path = r'C:\Users\Admin\AppData\Local\Google\Chrome\User Data\Profile 1'

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--user-data-dir={profile_path}')

Once the necessary libraries have been imported and the browser driver has been initialized, you can create the WebDriver object:

driver = webdriver.Chrome(options=chrome_options)

driver.get('https://www.linkedin.com/jobs/search?position=1&pageNum=0')

This option is suitable if you want to use a profile you are already logged into on LinkedIn. If you want to log in using a script, you will need to go to a different page:

driver.get("https://linkedin.com/uas/login")

To enter the username and password, let's identify the input fields:

Sign in to LinkedIn
Sign in to LinkedIn

Insert the user's login and password credentials into them:

username = driver.find_element(By.ID, "username")
username.send_keys("PUT-YOUR-LOGIN")  

password = driver.find_element(By.ID, "password")
password.send_keys("PUT-YOUR-PASSWORD") 

Confirm the data:

driver.find_element(By.XPATH, "//button[@type='submit']").click()

This authorization will be saved in your current profile, and you can scrape without this step. 

Use Web Scraping API

The simplest and safest way to scrape data from LinkedIn is to use a web scraping API. This will collect the necessary data and provide only the final results. This allows you to avoid using proxies and worrying about blocking issues, as requests to LinkedIn are not made on the client side but on the side of the API service.

Try Our Ready-Made Solutions for Your Needs

Get the power of web scraping without the hassle. The HasData web scraping API easily handles headless browsers and proxy rotation, so you can easily extract valuable…

Extract job listings and detailed employment data from Indeed. Access real-time job information, including titles, companies, locations, salaries, and job descriptions.

Let's consider this option using HasData's web scraping API as an example. To use it, log in to our website and copy your personal API key, which can be found in your account.

Next, we will create a new project and import the necessary libraries:

import requests
import json

Provide a link to the job listing and the recently copied API key:

ln_url = "https://www.linkedin.com/jobs/search?position=1&pageNum=0"
api_key = "PUT-YOUR-API-KEY"

Define an API endpoint and request parameters, including the type of proxies to use:

url = "https://api.hasdata.com/scrape/web"

payload = json.dumps({
  "url": ln_url,
  "proxyCountry": "US",
  "proxyType": "datacenter",
  "blockResources": True,
  "blockAds": True,
  "screenshot": True,
  "jsRendering": True,
  "extractEmails": True
})

headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}

Make the request:

response = requests.request("POST", url, headers=headers, data=payload)

As a result, HasData's web scraping API will return all data, including headers, content, and a screenshot of the page. 

Scrape Data from LinkedIn

Let's explore the process of scraping different LinkedIn pages using examples. The general approach remains the same, except for page URLs and structure.

We'll utilize HasData's web scraping API to retrieve pages, eliminating the need for proxies and avoiding blocking issues. Furthermore, we'll employ the BeautifulSoup library to parse the extracted page code. You can use any previously discussed methods to obtain the page's HTML code, and the processing and parsing steps will remain the same. 

LinkedIn Job Listings Scraping

Let's start by scraping data from the job listings. You can find a ready-made scraper script in Google Colaboratory. Let's go to the job search page and see what data we can extract.

LinkedIn Job Listing
LinkedIn job listing

Let's break down the filters we can use:

  1. f_SB2: Salary level from 1 to 5, starting at $40k with a $20k increment.
  2. f_E: Education level from 1 to 5, multiple selections allowed.
  3. f_TPR: Time period. If empty, vacancies are displayed for all time.
  4. location: Location. Country or city can be specified.
  5. keywords: Keywords to search for vacancies.
  6. f_JT: Job type. The first letter is used as a parameter, e.g., Full-Time as F or Part-time as P. Multiple options can be specified.
  7. position: Position number of the vacancy for which details are displayed.
  8. pageNum: Search page number.

Extracting data from a single page for all available vacancies without navigating through each page only allows us to extract data on the company, vacancy title, location, and number of responses.

Create a new script and get this data. To start, we'll import the necessary libraries:

import requests
import json
from bs4 import BeautifulSoup
import csv

Then set job settings and generate the link:

f_SB2 = 1  # Salary level from 40k to $120k
f_E = 1   # Expirience level 1 (e.g., Internship)
f_TPR = ""  # Vacancies displayed for all time periods
location = "United States"  # Country or city
keywords = "Data Scientist"  # Keywords for search
f_JT = "F"  # Employment type - Full-time
position = 1  # Job position number for details
pageNum = 0  # Page number of search results

# Constructing the URL

ln_url = f"https://www.linkedin.com/jobs/search?f_SB2={f_SB2}&f_E={f_E}&f_TPR={f_TPR}&location={location}&keywords={keywords}&f_JT={f_JT}&position={position}&pageNum={pageNum}"

Set your HasData’s API key:

api_key = "PUT-YOUR-API-KEY"

Get LinkedIn job listing data using web scraping API:

url = "https://api.hasdata.com/scrape/web"

payload = json.dumps({
  "url": ln_url,
  "proxyCountry": "US",
  "proxyType": "datacenter",
  "blockResources": True,
  "blockAds": True,
  "screenshot": True,
  "jsRendering": True,
  "extractEmails": True
})
headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}
response = requests.request("POST", url, headers=headers, data=payload)

job_content = response.json()['content']

Parse the HTML code of the page:

soup = BeautifulSoup(job_content, 'html.parser')

Get data for every job:

job_list = soup.find('ul', class_='jobs-search__results-list')
if job_list:
    jobs = job_list.find_all('li')
    job_data = []

    for job in jobs:
        print(job)
        job_title = job.find('h3', class_='base-search-card__title').get_text(strip=True) if job.find('h3', class_='base-search-card__title') else '-'
        company = job.find('h4', class_='base-search-card__subtitle').get_text(strip=True) if job.find('h4', class_='base-search-card__subtitle') else '-'
        location = job.find('span', class_='job-search-card__location').get_text(strip=True) if job.find('span', class_='job-search-card__location') else '-'
        job_link = job.find('a')['href'] if job.find('a') else '-'
        posted_date = job.find('time', class_='job-search-card__listdate')['datetime'] if job.find('time', class_='job-search-card__listdate') else '-'
       
        job_info = {
            'job_title': job_title,
            'company': company,
            'location': location,
            'job_link': job_link,
            'posted_date': posted_date
        }
      
        job_data.append(job_info)

Print the result on the screen:

if job_data:
    for job in job_data:
        print("Job Title:", job['job_title'])
        print("Company:", job['company'])
        print("Location:", job['location'])
        print("Job Link:", job['job_link'])
        print("Posted Date:", job['posted_date'])
        print("-" * 50)

Or save this data to CSV:

    with open("job_data.csv", 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['job_title', 'company', 'location', 'job_link', 'posted_date']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
     
        writer.writeheader()
        for job in job_data:
            writer.writerow(job)

As a result we will get:

The resulting file with job listing
The resulting file with job listing

As you can see, we get data on 60 vacancies after running the script, which is returned in a convenient format. Unfortunately, to get detailed data on each of the vacancies, you need to go through each one, which can be inconvenient, as it will require many requests.

Scrape LinkedIn Learning

LinkedIn Learning offers vast courses and video tutorials on diverse topics. Additionally, a ready-to-use script is available in Google Colaboratory.

Let's navigate the LinkedIn Learning page and explore the available filters and data extraction possibilities.

LinkedIn Learning
LinkedIn Learning

This page has fewer filters. Let's take a closer look at them:

  1. sortBy: The method for sorting results. For example, "RELEVANCE" for sorting by relevance.
  2. difficultyLevel: The difficulty level of the course. For example, "BEGINNER" for beginner level.
  3. entityType: The type of entity that will be returned in the search results. In this case, this is "COURSE" for courses.
  4. durationV2: The duration of the course. For example, "BETWEEN_0_TO_10_MIN" for courses lasting from 0 to 10 minutes.
  5. softwareNames: The names of the software that the courses are related to. For example, "Power+Platform" for courses related to Power Platform.

On this page, we can extract the title and link to the course, its author and the type of study material. Let's take our previous code as a basis and replace the parameters for generating the link:

sortBy = "RELEVANCE"
difficultyLevel = "BEGINNER"
entityType = "COURSE"
durationV2 = ""
softwareNames = ""

# Constructing the URL
ln_url = f"https://www.linkedin.com/learning/search?trk=content-hub-home-page_guest_nav_menu_learning&sortBy={sortBy}&difficultyLevel={difficultyLevel}&entityType={entityType}&durationV2={durationV2}&softwareNames={softwareNames}"

And selectors for parsing the page:

learn_list = soup.find('ul', class_='results-list')
if learn_list:
    learns = learn_list.find_all('li')
    learn_data = []
  
    for learn in learns:
        title = learn.find('h3', class_='base-search-card__title').text.strip() if learn.find('h3', class_='base-search-card__title') else '-'
        subtitle = learn.find('h4', class_='base-search-card__subtitle').text.strip() if learn.find('h4', class_='base-search-card__subtitle') else '-'
        identifier = learn.find('p', class_='base-search-card__identifier').text.strip() if learn.find('p', class_='base-search-card__identifier') else '-'
        learn_link = learn.find('a')['href'] if learn.find('a') else '-'
      
        job_info = {
            'title': title,
            'subtitle': subtitle,
            'identifier': identifier,
            'learn_link': learn_link,
        }
       
        learn_data.append(job_info)

The remaining code will stay the same. This will generate a file containing a list of available courses:

LinkedIn learning file
LinkedIn learning file

We received a file containing 50 beginner-friendly learning materials, including courses and video tutorials, ranging from 0 to 10 minutes in length. 

Scrape LinkedIn Articles

Another section that might need to be scraped quickly is LinkedIn Articles. A ready-made script can be found on Google Colaboratory.

Let's move to the LinkedIn Articles page and take a closer look at it:

LinkedIn Articles
LinkedIn Articles

Unlike previous examples, there are no filters that can be customized here. However, you can choose any category or subcategory from the ones on the right. As an example, we will use a link to the root section:

ln_url = f"https://www.linkedin.com/pulse/topics/home/"

And change selectors:

article_list = soup.find('div', class_='content-hub-home-core-rail')
if article_list:
    articles = article_list.find_all('div', class_='content-hub-entity-card-redesign')
    article_data = []
   
    for article in articles:
        title = article.find('h2').text.strip() if article.find('h2') else '-'
        description = article.find('p', class_='content-description').text.strip() if article.find('p', class_='content-description') else '-'
        contributions = article.find('span').text.strip() if article.find('span') else '-'
        timestamp = article.find_all('span')[-1].text.strip() if article.find('span') else '-'
        article_link = article.find('a')['href'] if article.find('a') else '-'

        article_info = {
            'title': title,
            'description': description,
            'contributions': contributions,
            'timestamp': timestamp,
            'article_link': article_link,
        }
   
        article_data.append(article_info)

Other parts of the script will remain the same as in the previous examples, producing the following CSV file:

LinkedIn articles file
LinkedIn articles file

We obtained 100 article links related to the specified topic in total. If necessary, the script can be refined to gather more detailed data by crawling through all links in a queue. This will allow us to retrieve article content, author information, links to their profiles, and discussion participants.

Conclusion

In this article, we explored various methods for accessing data on LinkedIn and examined different approaches to implementing scrapers. Additionally, we provided ready-to-use code examples and uploaded them to Google Colaboratory for convenient access and cloud-based execution.

As a result, we developed several handy tools that facilitate extracting personalized data from LinkedIn, regardless of whether the data is being retrieved from a course page or a job listing page. To exemplify and simplify data collection, we utilized HasData's API, which enables data gathering without the risk of getting blocked by LinkedIn and without the need to log in to one's profile.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Collect structured data without any coding!

Our no-code scrapers make it easy to extract data from popular websites with just a few clicks.

  • CSV, XLSX, and JSON Formats
  • No Coding or Software Required
  • Save Time and Effort
Scrape with No Code
Valentina Skakun

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.