[DTW] Week02


The assignment this week:

Scrap the web and then create a list with the material you collected.


I chose urbandictionary.com for this assignment.
This site provides multiple pages and each page contains a list of words and expressions.
The parameter for accessing a specific page is ‘page’.

I used the following python code to scrap the words :

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

def get_page(s):
  url = "https://www.urbandictionary.com/"
  response = requests.get(url, params={'page': s}, headers=headers)
  html = response.text
  soup = BeautifulSoup(html, "html.parser")
  titles = soup.select('a.word')

  output = []
  for title in titles:
    clean_title = title.text.strip().encode('utf-8')

  return output

start = 655
while start < 656:
  results = get_page(start)
  for r in results:
    print r

  start = start + 1

The number of pages was 655, and I collected 4584 words in total.
I saved all the results in a text file using the command ‘python ud.py > ud.txt’.
I did like this only because I didn’t know any other better way to make a list.
The text file is available here



Leave a Reply

Your email address will not be published. Required fields are marked *