Scraping Indeed job postings

So, what is the easiest way to get all the job posting details from Indeed.com?

Option 1: Subscribe to Specrom’s job posting data feed

We have an Job feed that will extract all the pertinent job posting information such as company name, city, snippet, job title etc. by just specifying a search query and a location.

The API is available on RapidAPI and there is a free trial with no credit card information required. Paid plans start at just $10.

Option 2: Full service web scraping service.

If you just need job postings data as a CSV or excel file, then simply contact us for our full service web scraping service. You can simply sit back and let us handle all the backend issues to get the data you need.

Option 3: Scrape indeed.com on your own

Python is great for web scraping and we will be using a library called Selenium to extract Job postings from Indeed for Atlanta, GA.

  • Fetching raw html page from the Indeed

  • We will automate entering of search query and location into the textbox and clicking enter using Selenium

### Using Selenium to extract Indeed.com's raw html source
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

import time

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

test_url = 'https://www.indeed.com/'

option = webdriver.ChromeOptions()
option.add_argument("--incognito")

chromedriver = r'chromedriver.exe'
browser = webdriver.Chrome(chromedriver, options=option)
browser.get(test_url)
text_area = browser.find_element_by_id('text-input-what')
text_area.send_keys("Web scraping")
text_area2=browser.find_element_by_id('text-input-where')
text_area.send_keys("Atlanta, GA")
element = browser.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button')
element.click()
html_source = browser.page_source
browser.close()

Using BeautifulSoup to extract Indeed job postings

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

  • You should open the page in the chrome browser and click inspect.

HTML source code for Indeed.com search query webpage
Figure 1: Inspecting the source of HTML source code for Indeed.com search query webpage.

Extracting job titles

From inspecting the html source, we see that job titles have h2 tags and belong to class ‘title’.

# extracting job titles
soup=BeautifulSoup(html_source, "html.parser")

job_title_src = soup.find_all('h2', {'class','title'})
job_title_list = []

for val in job_title_src:
    try:
        job_title_list.append(val.get_text())
    except:
        pass
job_title_list                                  
#Output
    ['Data Analyst',
 'Data Analyst',
 'Data/Reporting Analyst',
 'Sr. Data Analyst',
 'Data Analyst',
 'Data Analyst',
 'Behavior Data Analyst - Marcus Autism Center - Behavioral Analysis Core',
 'Data Analyst',
 'Data Analyst',
 'Police Analyst',
 'Data Analyst (2021-1614)',
 'Employee Data Analyst',
 'Data Analyst',
 'Data and Research Analyst',
 'Data Analyst (13255)']

Extracting company names

The next step is extracting company names. We see that it is span tag of class ‘company’.

# extracting Indeed addresses

company_name_src = soup.find_all('span',{'class', 'company'})

company_name_list = []

for val in company_name_src:
    company_name_list.append(val.get_text())
company_name_list
# Output
['KIPP Foundation',
 'Emory University',
 'City of Atlanta, GA',
 'The Coca-Cola Company',
 'KIPP Metro Atlanta Schools',
 'Spartan Technologies',
 "Children's Healthcare of Atlanta",
 'ARK Solutions',
 'Anthem',
 'City of Forest Park, GA',
 'Atrium CWS',
 'Salesforce',
 'Sovos Compliance',
 'Southern Poverty Law Center',
 'Baer Group']

Extracting snippets

Snippets are couple of sentences of text that briefly explain the job postings. Along with job title and company name, these are one of the most important pieces of information to extract from each Indeed job posting result.

As a reference, refer to the figure below.

individual job postings information from Indeed.com search results
Figure 2: individual job postings from Indeed.com search results.

  • We will extract snippet for each job postings. For brevity we will only show results from first three job postings, and you can verify that the first result matches the text in figure 2 above.
# extracting snippets from each job postings

snippet_src = soup.find_all('div', {'class', 'summary'})
snippet_list = []
for val in snippet_src:
    snippet_list.append(val.get_text())
    
snippet_list[:3]
# Output
['\n\nMaintain and troubleshoot the integrity of data linkages between data source systems and data warehouse.\nCollaborate with other Data Team members to develop and…\n',
 '\n\nCreates and maintains a data dictionary and meta data.\nAnalyzing data reporting data for clinical outcomes, qualitative and other types of research.\n',
 '\n\n3 years of work experience in creation, reporting, and/or management of data or closely related tasks (not including data entry).\n']

Converting into CSV file

You can take the lists above, and read it as a pandas DataFrame. Once you have the Dataframe, you can convert to CSV, Excel or JSON easily without any issues.

Scaling up to a full crawler for extracting all Indeed job postings

  • Once you scale up to make thousands of requests to fetch all the pages, the indeed.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.

  • To make it more likely to successfully fetch data for all USA, you will have to implement:

    • rotating proxy IP addresses preferably using residential proxies.
    • rotate user agents
    • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

After you follow all the steps above, you will realize that our pricing for managed web scraping or our Indeed scraper API is one of the most competitive in the market.