[DTW] Week11: Final Project

  • Final Project Idea: Scrapping data from Glassdoor and visualizing the data
  • Based on the job data scrapped from Glassdoor, it shows the number of job opportunities and the popular jobs in each state.  It also shows the number of the opportunities for the jobs that ITP students prefer
  • Scrapped Data
    • Total number of job opportunities in each state
    • Top jobs in each state and the number of job opportunities for each job
    • The number of opportunities for ‘Multimedia Designer’, ‘Software Developer’, and ‘UX Designer’
  •  Implementation
    • Python, Selenium: Scrapping data from glassdoor.com and making JSON files
    • HTML5/CSS/Javascript: Data Visualization
    • node.js: Running the server

The video below is the one I screen recorded while Selenium was scrapping the data:

As you can see at the end of the video, my programme was caught by CAPTCHA. It may have been because I  got lazy and didn’t put enough waiting times, or I ran it for too long. (It was running for about 30 minutes)
CAPTCHA released me a few hours later, but thereafter, although I added more time.sleep() functions, I couldn’t run the FOR loop at one go.
So I split them into 5 smaller loops and as a result, I got five JSON files. I manually join them together in the end.

Following is the source code that I used for scrapping data.
intheus.py was for scrapping the overall job market data in the US, and foritpers.py was for scrapping the job information that ITP students potentially have an interest in:

[intheus.py]

from selenium import webdriver
import json
import subprocess
import time
import string


def read_states():
  with open("states.txt") as f:
    content = f.readlines()
  content = [x.strip() for x in content]

  return content

def get_locid():
  
  states = read_states()

  locIds = []

  # this is needed to be done only first time at the beginning
  # temp = driver.find_elements_by_css_selector('#sc.location')
  temp = driver.find_element_by_id('LocationSearch')
  temp.send_keys("Alabama")
  tempBtn = driver.find_element_by_id('HeroSearchButton')
  tempBtn.click()

  for s in states:
    text = driver.find_element_by_id('sc.location')
    text.clear()
    text.send_keys(s)
    button = driver.find_element_by_id('HeroSearchButton')
    button.click()
    time.sleep(1)
    i = []
    i = driver.find_elements_by_class_name('locId')
    locIds.append(i[0].get_attribute("value"))
  
  return locIds	

def get_jobs_and_links():

  states = read_states()

  jobsNlinks = []

  # this is needed to be done only first time at the beginning
  temp = driver.find_element_by_id('LocationSearch')
  temp.send_keys("New York")
  tempBtn = driver.find_element_by_id('HeroSearchButton')
  tempBtn.click()

  out = []
  # for s in states:
  for x in range(40, 50):
    s = states[x]
    text = driver.find_element_by_id('sc.location')
    text.clear()
    text.send_keys(s)
    button = driver.find_element_by_id('HeroSearchButton')
    button.click()
    time.sleep(1)
    aas = driver.find_elements_by_class_name('links-group')[0].find_elements_by_class_name('links')[0].find_elements_by_tag_name('a')
    # lsNjs = []
    item = { 
      "state": s, 
      "topjobs": []
    }

    for a in aas:
      print(a.text)
      print(a.get_attribute("href"))
      lj = {
        "job": a.text,
        "link": a.get_attribute("href"),
        "count" : ""
      }
      item["topjobs"].append(lj)
      # ls.append(a.get_attribute("href"))  
      # js.append(a.text)
    out.append(item)

  return out

def get_jobcounts():
  for d in data:
    for tj in d["topjobs"]:
      url = tj["link"]
      driver.get(url)
      time.sleep(2.1)   #####################
      c = driver.find_elements_by_class_name('jobsCount')[0].text
      subc = c.replace(' Jobs', '')
      subc = subc.replace(',', '')
      print(subc)
      tj["count"] = int(subc)


def get_alljobcounts():

  states = read_states()

  # this is needed to be done only first time at the beginning
  # temp = driver.find_elements_by_css_selector('#sc.location')
  temp = driver.find_element_by_id('LocationSearch')
  temp.send_keys("Alabama")
  tempBtn = driver.find_element_by_id('HeroSearchButton')
  tempBtn.click()

  out = []

  for s in states:
    time.sleep(1.4)
    text = driver.find_element_by_id('sc.location')
    text.clear()
    text.send_keys(s)
    button = driver.find_element_by_id('HeroSearchButton')
    time.sleep(1.6)
    button.click()
    time.sleep(2.1)
    c = driver.find_elements_by_class_name('jobsCount')[0].text
    subc = c.replace(' Jobs', '')
    subc = subc.replace(',', '')
    print(subc)
    item = {
      "total": subc
    }
    out.append(item)

  return out
    

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(chrome_options=options)

driver.get('https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=&sc.keyword=&locT=&locId=&jobType=') 

# ids = get_locid()

######

# data = get_jobs_and_links()

# get_jobcounts()

# with open("intheus05.json", "w") as infile:
# 	json.dump(data, infile, indent=2)

data2 = get_alljobcounts()
with open("totaljobs.json", "w") as infile:
  json.dump(data2, infile, indent=2)

[foritper.py]

from selenium import webdriver
import json
import subprocess
import time
import string


def read_states():
  with open("states.txt") as f:
    content = f.readlines()
  content = [x.strip() for x in content]

  return content


def get_numberPerState(job_title):

  states = read_states()

  jobsNlinks = []

  # this is needed to be done only first time at the beginning
  temp = driver.find_element_by_id('LocationSearch')
  temp.send_keys("New York")
  tempBtn = driver.find_element_by_id('HeroSearchButton')
  tempBtn.click()

  item = {}

  # for s in states:
  for x in range(0, 50):
    s = states[x]
    time.sleep(1.6)
    job = driver.find_element_by_id('sc.keyword')
    job.clear()
    job.send_keys(job_title)
    time.sleep(1.2)

    loc = driver.find_element_by_id('sc.location')
    loc.clear()
    loc.send_keys(s)
    time.sleep(1.8)

    button = driver.find_element_by_id('HeroSearchButton')
    button.click()
    time.sleep(1.5)
    c = driver.find_elements_by_class_name('jobsCount')[0].text
    subc = c.replace(' Job', '')
    subc = subc.replace('s', '')
    subc = subc.replace(',', '')
    print(subc)

    item[s] = int(subc)

  out = {
    job_title : item
  }
    

  return out


options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(chrome_options=options)

driver.get('https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=&sc.keyword=&locT=&locId=&jobType=') 


data = get_numberPerState("ar vr")

with open("arvr01.json", "w") as infile:
  json.dump(data, infile, indent=2)


Talking more about the JSON file, the initial format of them was like below:

[
  {
    "state": "Alabama", 
    "total": 41018,
    "topjobs": [
      {
        "count": 157, 
        "job": "Attorney", 
        "link": "https://www.glassdoor.com/Job/alabama-attorney-jobs-SRCH_IL.0,7_IS105_KO8,16.htm"
      }, ...
    ]
  }, ...
]

However, after I got started working on it, I soon realized that this format isn’t that effective for building my data visualization website.
Therefore, I made a JSON format converter which converts the format above into the format below:

{
  "Alabama": {
    "total": 41018,
    "topjobs": [
      {
        "count": 157,
        "job": "Attorney",
        "link": "https://www.glassdoor.com/Job/alabama-attorney-jobs-SRCH_IL.0,7_IS105_KO8,16.htm"
      }, ...
    ]
  }, ...
}

I used two JSON files, intheus.json and foritper.json

Here is more detail about the data visualization part.

The actual website is here.

 

REFERENCES

Leave a Reply

Your email address will not be published. Required fields are marked *