- Final Project Idea: Scrapping data from Glassdoor and visualizing the data
- Based on the job data scrapped from Glassdoor, it shows the number of job opportunities and the popular jobs in each state. It also shows the number of the opportunities for the jobs that ITP students prefer
- Scrapped Data
- Total number of job opportunities in each state
- Top jobs in each state and the number of job opportunities for each job
- The number of opportunities for ‘Multimedia Designer’, ‘Software Developer’, and ‘UX Designer’
- Implementation
- Python, Selenium: Scrapping data from glassdoor.com and making JSON files
- HTML5/CSS/Javascript: Data Visualization
- node.js: Running the server
The video below is the one I screen recorded while Selenium was scrapping the data:
As you can see at the end of the video, my programme was caught by CAPTCHA. It may have been because I got lazy and didn’t put enough waiting times, or I ran it for too long. (It was running for about 30 minutes)
CAPTCHA released me a few hours later, but thereafter, although I added more time.sleep() functions, I couldn’t run the FOR loop at one go.
So I split them into 5 smaller loops and as a result, I got five JSON files. I manually join them together in the end.
Following is the source code that I used for scrapping data.
intheus.py was for scrapping the overall job market data in the US, and foritpers.py was for scrapping the job information that ITP students potentially have an interest in:
[intheus.py]
from selenium import webdriver import json import subprocess import time import string def read_states(): with open("states.txt") as f: content = f.readlines() content = [x.strip() for x in content] return content def get_locid(): states = read_states() locIds = [] # this is needed to be done only first time at the beginning # temp = driver.find_elements_by_css_selector('#sc.location') temp = driver.find_element_by_id('LocationSearch') temp.send_keys("Alabama") tempBtn = driver.find_element_by_id('HeroSearchButton') tempBtn.click() for s in states: text = driver.find_element_by_id('sc.location') text.clear() text.send_keys(s) button = driver.find_element_by_id('HeroSearchButton') button.click() time.sleep(1) i = [] i = driver.find_elements_by_class_name('locId') locIds.append(i[0].get_attribute("value")) return locIds def get_jobs_and_links(): states = read_states() jobsNlinks = [] # this is needed to be done only first time at the beginning temp = driver.find_element_by_id('LocationSearch') temp.send_keys("New York") tempBtn = driver.find_element_by_id('HeroSearchButton') tempBtn.click() out = [] # for s in states: for x in range(40, 50): s = states[x] text = driver.find_element_by_id('sc.location') text.clear() text.send_keys(s) button = driver.find_element_by_id('HeroSearchButton') button.click() time.sleep(1) aas = driver.find_elements_by_class_name('links-group')[0].find_elements_by_class_name('links')[0].find_elements_by_tag_name('a') # lsNjs = [] item = { "state": s, "topjobs": [] } for a in aas: print(a.text) print(a.get_attribute("href")) lj = { "job": a.text, "link": a.get_attribute("href"), "count" : "" } item["topjobs"].append(lj) # ls.append(a.get_attribute("href")) # js.append(a.text) out.append(item) return out def get_jobcounts(): for d in data: for tj in d["topjobs"]: url = tj["link"] driver.get(url) time.sleep(2.1) ##################### c = driver.find_elements_by_class_name('jobsCount')[0].text subc = c.replace(' Jobs', '') subc = subc.replace(',', '') print(subc) tj["count"] = int(subc) def get_alljobcounts(): states = read_states() # this is needed to be done only first time at the beginning # temp = driver.find_elements_by_css_selector('#sc.location') temp = driver.find_element_by_id('LocationSearch') temp.send_keys("Alabama") tempBtn = driver.find_element_by_id('HeroSearchButton') tempBtn.click() out = [] for s in states: time.sleep(1.4) text = driver.find_element_by_id('sc.location') text.clear() text.send_keys(s) button = driver.find_element_by_id('HeroSearchButton') time.sleep(1.6) button.click() time.sleep(2.1) c = driver.find_elements_by_class_name('jobsCount')[0].text subc = c.replace(' Jobs', '') subc = subc.replace(',', '') print(subc) item = { "total": subc } out.append(item) return out options = webdriver.ChromeOptions() options.add_argument('--ignore-certificate-errors') options.add_argument("--test-type") driver = webdriver.Chrome(chrome_options=options) driver.get('https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=&sc.keyword=&locT=&locId=&jobType=') # ids = get_locid() ###### # data = get_jobs_and_links() # get_jobcounts() # with open("intheus05.json", "w") as infile: # json.dump(data, infile, indent=2) data2 = get_alljobcounts() with open("totaljobs.json", "w") as infile: json.dump(data2, infile, indent=2)
[foritper.py]
from selenium import webdriver import json import subprocess import time import string def read_states(): with open("states.txt") as f: content = f.readlines() content = [x.strip() for x in content] return content def get_numberPerState(job_title): states = read_states() jobsNlinks = [] # this is needed to be done only first time at the beginning temp = driver.find_element_by_id('LocationSearch') temp.send_keys("New York") tempBtn = driver.find_element_by_id('HeroSearchButton') tempBtn.click() item = {} # for s in states: for x in range(0, 50): s = states[x] time.sleep(1.6) job = driver.find_element_by_id('sc.keyword') job.clear() job.send_keys(job_title) time.sleep(1.2) loc = driver.find_element_by_id('sc.location') loc.clear() loc.send_keys(s) time.sleep(1.8) button = driver.find_element_by_id('HeroSearchButton') button.click() time.sleep(1.5) c = driver.find_elements_by_class_name('jobsCount')[0].text subc = c.replace(' Job', '') subc = subc.replace('s', '') subc = subc.replace(',', '') print(subc) item[s] = int(subc) out = { job_title : item } return out options = webdriver.ChromeOptions() options.add_argument('--ignore-certificate-errors') options.add_argument("--test-type") driver = webdriver.Chrome(chrome_options=options) driver.get('https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=&sc.keyword=&locT=&locId=&jobType=') data = get_numberPerState("ar vr") with open("arvr01.json", "w") as infile: json.dump(data, infile, indent=2)
Talking more about the JSON file, the initial format of them was like below:
[ { "state": "Alabama", "total": 41018, "topjobs": [ { "count": 157, "job": "Attorney", "link": "https://www.glassdoor.com/Job/alabama-attorney-jobs-SRCH_IL.0,7_IS105_KO8,16.htm" }, ... ] }, ... ]
However, after I got started working on it, I soon realized that this format isn’t that effective for building my data visualization website.
Therefore, I made a JSON format converter which converts the format above into the format below:
{ "Alabama": { "total": 41018, "topjobs": [ { "count": 157, "job": "Attorney", "link": "https://www.glassdoor.com/Job/alabama-attorney-jobs-SRCH_IL.0,7_IS105_KO8,16.htm" }, ... ] }, ... }
I used two JSON files, intheus.json and foritper.json
Here is more detail about the data visualization part.
The actual website is here.
REFERENCES
- How to manipulate String in Python
- How to slice a string by character
- How to click a button with Python & Selenium
- How to add text to a text field with Selenium
- How to clear text from a text field with Selenium
- What is the Selenium property for getAttribute?
- How to read a file line-by-line in Python
- FOR loop syntax in Python
- How to read and write a file in Python
- How to extract a string from between quotations
- How to add an element to a JSON list
- How to remove specific substrings from a set of strings in Python
- How to parse a string to a float or int in Python