How to Scrape LinkedIn with Python
LinkedIn, the world’s largest professional social networking site with 1 billion members in more than 200 countries and territories worldwide. It is a valuable resource for both businesses and individuals, as it provided a platform for networking, finding jobs, and learning about new industries.
In this article, we’ll delve into LinkedIn data scraping and explore the methods and tools for extracting this valuable information. We’ll provide step-by-step guides and ready-to-use tools in Google Colaboratory, empowering you to harness the power of LinkedIn data for your specific needs.
Extract job listings and associated details from Indeed.com with ease. No coding needed! Download your data in JSON, CSV, and Excel formats for easy analysis, research and more.
Our Glassdoor Scraper tool will gather Glassdoor job data including salaries, company size, employees, industries, location, logo, organization type, investors, social media sites, reviews, and more for you. Download your data in JSON, CSV, and Excel formats.
LinkedIn Structure and Data Objects
As we said before, LinkedIn is a professional networking platform where individuals can create profiles showcasing their skills, experiences, and professional accomplishments. It’s widely used by job seekers, recruiters, and professionals looking to connect with others in their field.
Before delving into the data collection process, let’s explore LinkedIn in more detail and identify the information we can extract. Here are some of the primary data points available on LinkedIn:
User Profiles: These profiles provide comprehensive information about users, including their name, location, current and past work experiences, educational background, skills, and endorsements.
Job Listings: LinkedIn’s job search feature offers advanced filters to narrow down results to the most relevant opportunities. Data extractable from job listings includes company details, location, job descriptions, and specific requirements for candidates.
LinkedIn Learning: This platform offers many online courses and video tutorials. Extractable data includes course titles, descriptions, instructors, and links to course materials.
LinkedIn Articles: The Collaborative Articles section features articles written by users and experts, categorized by topic or author. Data extraction can capture article titles, authors, content summaries, and publication dates.
Scraping various platform elements provides quick access to valuable data about users and companies. User data can assist in identifying qualified and experienced professionals for recruitment purposes, and company data can aid in finding potential clients, partners, or future employment opportunities.
Types of LinkedIn scrapers
Before scraping data, it’s crucial to determine the appropriate method for data acquisition. Based on the chosen approach, there are two primary types of LinkedIn scrapers:
Proxy-based LinkedIn scrapers. These scrapers utilize a pool of proxies to mask your IP address and avoid detection by LinkedIn. This method allows for large-scale data extraction without risking account suspension. However, it may be unable to access specific data due to LinkedIn limitations.
Cookie-based LinkedIn scrapers. These scrapers leverage your existing LinkedIn session cookies to mimic your browsing activity. This method offers targeted data collection and access to personalized information. However, it relies on your account, which could lead to an account suspension if LinkedIn detects suspicious activity.
In this article, we’ll delve into the details of each method, providing code examples to illustrate their implementation. We’ll also discuss the advantages and limitations of each approach to help you choose the most suitable one for your specific scraping needs.
Prerequisites
To use the examples in this article, you will need Python 3.10 or higher, excluding a LinkedIn account. A Python IDE is recommended for a more streamlined experience, but any code editor with syntax highlighting and Python installed will suffice.
A virtual environment is optional if you don’t want to enhance the security of your scraping. We’ve already covered how to set up and use it in the Python scraping article.
We’ll employ several libraries in this article. Install them using the package manager:
pip install beautifulsoup4 requests selenium requests_oauthlib json
You may also need a web driver to use Selenium. The Selenium scraping article provides instructions on where to find and use it.
Ways to Extract Data from LinkedIn
Let’s explore the different methods for extracting data from LinkedIn. Regardless of the chosen method, the scraping process remains unchanged except for using the LinkedIn API. The only difference lies in the initial script for obtaining the page’s source code.
Using LinkedIn API
Like many other platforms with large datasets, LinkedIn offers its API for data access. However, this method has both advantages and disadvantages.
For instance, using LinkedIn guarantees easy and quick access to any necessary data in a convenient format. However, working with the LinkedIn API can be more challenging than initially anticipated. The initial hurdle is setting up a server to generate a token, which can be quite intricate without the necessary expertise.
Furthermore, LinkedIn imposes daily request limits on API calls. This can lead to scalability issues for applications that frequently access the API.
In terms of the data retrievable through the API, not all LinkedIn data is extractable via the API. For instance, accessing the list of available job openings is not permitted. Moreover, as an individual developer, you’ll be required to specify a test company during app creation, and using such a company restricts access to certain endpoints.
If you still decide to try this method, proceed to the developer page and create your application. To obtain the required keys, you’ll need to fill out a form, after which you can find your personal “Client ID” and “Client Secret” in the “Auth” section.
Create a new Python project, import the necessary libraries, and provide the obtained credentials:
from requests_oauthlib import OAuth2Session
cl_id = 'PUT-YOUR-CLIENT-ID'
cl_secret = 'PUT-YOUR-CLIENT-SECRET'
We will then create variables to store the necessary links for obtaining tokens and authorization:
redirect_url = 'http://localhost:8080/callback'
base_url = 'https://www.linkedin.com/oauth/v2/authorization'
token_url = 'https://www.linkedin.com/oauth/v2/accessToken'
Open the session:
linkedin = OAuth2Session(cl_id, redirect_uri=redirect_url)
Generate an authorization link and display it on the screen:
authorization_url, state = linkedin.authorization_url(base_url)
print('Go to:', authorization_url)
Next, you’ll need to follow the link in your browser manually. Alternatively, you can integrate Selenium to automate the token retrieval process. Once you have the token, return the full link to the script:
response = input('Put full URL: ')
Then get the token:
linkedin.fetch_token(token_url, client_secret=cl_secret, authorization_response=response)
With the token you’ve received, you can now access the data you need using the LinkedIn API. The comprehensive LinkedIn API documentation provides detailed information about the various endpoints available.
Using Requests and BeautifulSoup
Another option for obtaining data is to use a regular request library and proxies or cookies. As we mentioned earlier, as long as LinkedIn doesn’t find your activity suspicious, you can get any publicly available data without authorization. However, in this case, the risk of being blocked is high.
On the other hand, you can constantly change proxies to avoid blocking your real IP address. Or you can use your real cookies and headers to make your requests less suspicious. Let’s consider both options and start with using proxies.
We have already written in detail about how to use proxies with the Requests library in Python, so we will not dwell on this in detail and will give you a ready-made example of executing a request with a proxy:
import requests
url ="https://www.linkedin.com/jobs/search?position=1&pageNum=0"
proxies = {
'http': 'http://45.95.147.106:8080',
'https': 'https://37.187.17.89:3128'
}
response = requests.get(url, proxies=proxies)
You can set up proxy rotation yourself, or use a proxy server. You can find a list of reliable proxy providers in our top 10 residential proxy providers.
If you want to use your actual cookies, go to the LinkedIn page, log in, and go to DevTools (F12 or right-click and Inspect). Then go to the Network tab and find the desired field:
In addition to the headers mentioned above, you should include some other headers, including a valid User-Agent string. You can use our example, but you should replace the placeholder values with your cookies and the latest User-Agent string before using it.
cookies = "YOUR-COOKIES"
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Connection':'keep-alive',
'accept-encoding': 'gzip, deflate, br',
'Referer':'http://www.linkedin.com/',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Cookie': cookies
}
When making a request, be sure to include these headers:
response = requests.get(url, headers=headers)
You are free to choose either of the options under consideration, as both will receive the necessary data.
Using Profiles in Selenium
Another option is to use Selenium profiles. You can use an existing browser profile where you are logged into LinkedIn or create a new profile and automate the login process using the Selenium library functionality.
You can also simply log in using Selenium without using profiles. However, in this case, you will have to do this each time you run your script.
Therefore, it is easier to log in to a specific profile once and then simply collect data. In this case, the authorization will be saved after restarting the script. It also allows you to manually log in to your account using Google and then use the profile with the completed authorization.
To do this, import the library:
from selenium import webdriver
Specify the profile path and set web driver options:
profile_path = r'C:\Users\Admin\AppData\Local\Google\Chrome\User Data\Profile 1'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--user-data-dir={profile_path}')
Once the necessary libraries have been imported and the browser driver has been initialized, you can create the WebDriver object:
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://www.linkedin.com/jobs/search?position=1&pageNum=0')
This option is suitable if you want to use a profile you are already logged into on LinkedIn. If you want to log in using a script, you will need to go to a different page:
driver.get("https://linkedin.com/uas/login")
To enter the username and password, let’s identify the input fields:
Insert the user’s login and password credentials into them:
username = driver.find_element(By.ID, "username")
username.send_keys("PUT-YOUR-LOGIN")
password = driver.find_element(By.ID, "password")
password.send_keys("PUT-YOUR-PASSWORD")
Confirm the data:
driver.find_element(By.XPATH, "//button[@type='submit']").click()
This authorization will be saved in your current profile, and you can scrape without this step.
Use Web Scraping API
The simplest and safest way to scrape data from LinkedIn is to use a web scraping API. This will collect the necessary data and provide only the final results. This allows you to avoid using proxies and worrying about blocking issues, as requests to LinkedIn are not made on the client side but on the side of the API service.
Automate the extraction of job listing data and gain valuable insights from Indeed with our Indeed API to make your market analysis faster, more accurate, and more efficient.
Web Scraping API allows you to scrape web pages without the hassle of managing proxies, headless browsers, and captchas. Simply send the URL and get the HTML response in return.
Let’s consider this option using HasData’s web scraping API as an example. To use it, log in to our website and copy your personal API key, which can be found in your account.
Next, we will create a new project and import the necessary libraries:
import requests
import json
Provide a link to the job listing and the recently copied API key:
ln_url = "https://www.linkedin.com/jobs/search?position=1&pageNum=0"
api_key = "PUT-YOUR-API-KEY"
Define an API endpoint and request parameters, including the type of proxies to use:
url = "https://api.hasdata.com/scrape/web"
payload = json.dumps({
"url": ln_url,
"proxyCountry": "US",
"proxyType": "datacenter",
"blockResources": True,
"blockAds": True,
"screenshot": True,
"jsRendering": True,
"extractEmails": True
})
headers = {
'Content-Type': 'application/json',
'x-api-key': api_key
}
Make the request:
response = requests.request("POST", url, headers=headers, data=payload)
As a result, HasData’s web scraping API will return all data, including headers, content, and a screenshot of the page.
Scrape Data from LinkedIn
Let’s explore the process of scraping different LinkedIn pages using examples. The general approach remains the same, except for page URLs and structure.
We’ll utilize HasData’s web scraping API to retrieve pages, eliminating the need for proxies and avoiding blocking issues. Furthermore, we’ll employ the BeautifulSoup library to parse the extracted page code. You can use any previously discussed methods to obtain the page’s HTML code, and the processing and parsing steps will remain the same.
LinkedIn Job Listings Scraping
Let’s start by scraping data from the job listings. You can find a ready-made scraper script in Google Colaboratory. Let’s go to the job search page and see what data we can extract.
Let’s break down the filters we can use:
f_SB2
: Salary level from 1 to 5, starting at $40k with a $20k increment.f_E
: Education level from 1 to 5, multiple selections allowed.f_TPR
: Time period. If empty, vacancies are displayed for all time.location
: Location. Country or city can be specified.keywords
: Keywords to search for vacancies.f_JT
: Job type. The first letter is used as a parameter, e.g., Full-Time as F or Part-time as P. Multiple options can be specified.position
: Position number of the vacancy for which details are displayed.pageNum
: Search page number.
Extracting data from a single page for all available vacancies without navigating through each page only allows us to extract data on the company, vacancy title, location, and number of responses.
Create a new script and get this data. To start, we’ll import the necessary libraries:
import requests
import json
from bs4 import BeautifulSoup
import csv
Then set job settings and generate the link:
f_SB2 = 1 # Salary level from 40k to $120k
f_E = 1 # Expirience level 1 (e.g., Internship)
f_TPR = "" # Vacancies displayed for all time periods
location = "United States" # Country or city
keywords = "Data Scientist" # Keywords for search
f_JT = "F" # Employment type - Full-time
position = 1 # Job position number for details
pageNum = 0 # Page number of search results
# Constructing the URL
ln_url = f"https://www.linkedin.com/jobs/search?f_SB2={f_SB2}&f_E={f_E}&f_TPR={f_TPR}&location={location}&keywords={keywords}&f_JT={f_JT}&position={position}&pageNum={pageNum}"
Set your HasData’s API key:
api_key = "PUT-YOUR-API-KEY"
Get LinkedIn job listing data using web scraping API:
url = "https://api.hasdata.com/scrape/web"
payload = json.dumps({
"url": ln_url,
"proxyCountry": "US",
"proxyType": "datacenter",
"blockResources": True,
"blockAds": True,
"screenshot": True,
"jsRendering": True,
"extractEmails": True
})
headers = {
'Content-Type': 'application/json',
'x-api-key': api_key
}
response = requests.request("POST", url, headers=headers, data=payload)
job_content = response.json()['content']
Parse the HTML code of the page:
soup = BeautifulSoup(job_content, 'html.parser')
Get data for every job:
job_list = soup.find('ul', class_='jobs-search__results-list')
if job_list:
jobs = job_list.find_all('li')
job_data = []
for job in jobs:
print(job)
job_title = job.find('h3', class_='base-search-card__title').get_text(strip=True) if job.find('h3', class_='base-search-card__title') else '-'
company = job.find('h4', class_='base-search-card__subtitle').get_text(strip=True) if job.find('h4', class_='base-search-card__subtitle') else '-'
location = job.find('span', class_='job-search-card__location').get_text(strip=True) if job.find('span', class_='job-search-card__location') else '-'
job_link = job.find('a')['href'] if job.find('a') else '-'
posted_date = job.find('time', class_='job-search-card__listdate')['datetime'] if job.find('time', class_='job-search-card__listdate') else '-'
job_info = {
'job_title': job_title,
'company': company,
'location': location,
'job_link': job_link,
'posted_date': posted_date
}
job_data.append(job_info)
Print the result on the screen:
if job_data:
for job in job_data:
print("Job Title:", job['job_title'])
print("Company:", job['company'])
print("Location:", job['location'])
print("Job Link:", job['job_link'])
print("Posted Date:", job['posted_date'])
print("-" * 50)
Or save this data to CSV:
with open("job_data.csv", 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['job_title', 'company', 'location', 'job_link', 'posted_date']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for job in job_data:
writer.writerow(job)
As a result we will get:
As you can see, we get data on 60 vacancies after running the script, which is returned in a convenient format. Unfortunately, to get detailed data on each of the vacancies, you need to go through each one, which can be inconvenient, as it will require many requests.
Scrape LinkedIn Learning
LinkedIn Learning offers vast courses and video tutorials on diverse topics. Additionally, a ready-to-use script is available in Google Colaboratory.
Let’s navigate the LinkedIn Learning page and explore the available filters and data extraction possibilities.
This page has fewer filters. Let’s take a closer look at them:
sortBy
: The method for sorting results. For example, “RELEVANCE” for sorting by relevance.difficultyLevel
: The difficulty level of the course. For example, “BEGINNER” for beginner level.entityType
: The type of entity that will be returned in the search results. In this case, this is “COURSE” for courses.durationV2
: The duration of the course. For example, “BETWEEN_0_TO_10_MIN” for courses lasting from 0 to 10 minutes.softwareNames
: The names of the software that the courses are related to. For example, “Power+Platform” for courses related to Power Platform.
On this page, we can extract the title and link to the course, its author and the type of study material. Let’s take our previous code as a basis and replace the parameters for generating the link:
sortBy = "RELEVANCE"
difficultyLevel = "BEGINNER"
entityType = "COURSE"
durationV2 = ""
softwareNames = ""
# Constructing the URL
ln_url = f"https://www.linkedin.com/learning/search?trk=content-hub-home-page_guest_nav_menu_learning&sortBy={sortBy}&difficultyLevel={difficultyLevel}&entityType={entityType}&durationV2={durationV2}&softwareNames={softwareNames}"
And selectors for parsing the page:
learn_list = soup.find('ul', class_='results-list')
if learn_list:
learns = learn_list.find_all('li')
learn_data = []
for learn in learns:
title = learn.find('h3', class_='base-search-card__title').text.strip() if learn.find('h3', class_='base-search-card__title') else '-'
subtitle = learn.find('h4', class_='base-search-card__subtitle').text.strip() if learn.find('h4', class_='base-search-card__subtitle') else '-'
identifier = learn.find('p', class_='base-search-card__identifier').text.strip() if learn.find('p', class_='base-search-card__identifier') else '-'
learn_link = learn.find('a')['href'] if learn.find('a') else '-'
job_info = {
'title': title,
'subtitle': subtitle,
'identifier': identifier,
'learn_link': learn_link,
}
learn_data.append(job_info)
The remaining code will stay the same. This will generate a file containing a list of available courses:
We received a file containing 50 beginner-friendly learning materials, including courses and video tutorials, ranging from 0 to 10 minutes in length.
Scrape LinkedIn Articles
Another section that might need to be scraped quickly is LinkedIn Articles. A ready-made script can be found on Google Colaboratory.
Let’s move to the LinkedIn Articles page and take a closer look at it:
Unlike previous examples, there are no filters that can be customized here. However, you can choose any category or subcategory from the ones on the right. As an example, we will use a link to the root section:
ln_url = f"https://www.linkedin.com/pulse/topics/home/"
And change selectors:
article_list = soup.find('div', class_='content-hub-home-core-rail')
if article_list:
articles = article_list.find_all('div', class_='content-hub-entity-card-redesign')
article_data = []
for article in articles:
title = article.find('h2').text.strip() if article.find('h2') else '-'
description = article.find('p', class_='content-description').text.strip() if article.find('p', class_='content-description') else '-'
contributions = article.find('span').text.strip() if article.find('span') else '-'
timestamp = article.find_all('span')[-1].text.strip() if article.find('span') else '-'
article_link = article.find('a')['href'] if article.find('a') else '-'
article_info = {
'title': title,
'description': description,
'contributions': contributions,
'timestamp': timestamp,
'article_link': article_link,
}
article_data.append(article_info)
Other parts of the script will remain the same as in the previous examples, producing the following CSV file:
We obtained 100 article links related to the specified topic in total. If necessary, the script can be refined to gather more detailed data by crawling through all links in a queue. This will allow us to retrieve article content, author information, links to their profiles, and discussion participants.
Conclusion
In this article, we explored various methods for accessing data on LinkedIn and examined different approaches to implementing scrapers. Additionally, we provided ready-to-use code examples and uploaded them to Google Colaboratory for convenient access and cloud-based execution.
As a result, we developed several handy tools that facilitate extracting personalized data from LinkedIn, regardless of whether the data is being retrieved from a course page or a job listing page. To exemplify and simplify data collection, we utilized HasData’s API, which enables data gathering without the risk of getting blocked by LinkedIn and without the need to log in to one’s profile.
Might Be Interesting
Dec 6, 2024
XPath vs CSS Selectors: Pick Your Best Tool
Explore the key differences between CSS selectors and XPath, comparing their advantages, limitations, and use cases. Learn about performance, syntax, flexibility, and how to test and build selectors for web development.
- Basics
- Use Cases
Oct 29, 2024
How to Scrape YouTube Data for Free: A Complete Guide
Learn effective methods for scraping YouTube data, including extracting video details, channel info, playlists, comments, and search results. Explore tools like YouTube Data API, yt-dlp, and Selenium for a step-by-step guide to accessing valuable YouTube insights.
- Python
- Tutorials and guides
- Tools and Libraries
Oct 16, 2024
Scrape Etsy.com Product, Shop and Search Results Data
Learn how to scrape Etsy product, shop, and search results data with methods like Requests, BeautifulSoup, Selenium, and web scraping APIs. Explore strategies for data extraction and storage from Etsy's platform.
- E-commerce
- Tutorials and guides
- Python