web scraping out in the wild python

Lasted updated by @craigaddyman on 20 August 2015

So I've done a little web scraping, mainly for myself, a little for work purposes but a few nights ago my wife was telling me about work and a task she had to do for one of her colleagues, basically fetching data from a website that was full of contractors information from within her industry - intrigued I took a look and thought I could probably automate this :)

I'm not going to reveal the website but here's what went down.

Having a look around the site I realised all the links were in javascript this was the first hurdle, luckily it wasn't actually going to be a problem..., I opened up a contractor page and the URL structure was like so...


I started playing with the URL to see how it worked and quickly realised that if I removed the contractor name the page would still resolve, basically I now had the same page but the URL was now like this...


Next I changed the numbers from 3547 (or whatever it was) to just a number 1, I pressed enter and nothing, the page resolved but it was a blank template. I tried 2 and bingo, there was the first contractors information - a few more tests and it was clear this number was an assignment number in order of the contractor being added to the database.

So this means I could by-pass all the javascript issue and spider each page directly.

The next step was even easier, the name of the contractor was in a h3 heading tag, the only h3 tag per page and the email of the contractor; the only two bits I needed were marked up with 'mailto:' attributes.

Within 30 minutes or so I had this running in the cloud doing all the hard work...

from bs4 import BeautifulSoup
import requests
import time
import csv

dic = {}

site = 'http://www.domain.com/contractor-search/'

for i in range(0, 5000):
        r = requests.get(site + str(i))
        data = r.text
        soup = BeautifulSoup(data)
        for link in soup.find_all('a'):
            if 'mailto' in link.get('href'):
                email = link.get('href')
        for h3 in soup.findAll("h3"):
            title = h3.text
            dic[str(email)] = str(title)

print dic

with open('output_v3.csv','wb') as f:
    w = csv.writer(f)

As you can see I'm using the range to iterate through 5000 times and I'm also using this as the URL numbers so the URL goes like this through each iteration...




and so on...



Subscribe by email