Last Updated: 20-August-2015

I wrote this script as I was getting pissed off with Excel crashing when checking a measly 100,000 links; working with formulas, specifically VLOOKUP can be a real nightmare!

It seems to be working fine, I haven't encountered any bugs etc. but I would like to speed it up further if I can. I have loaded close to a million URLs into it and it runs fine but that takes about an hour and a half when running it straight off my desktop.

If anyone would like to give me suggestions on how to get it running faster, that would be great and of course, credit given when credit is due! :)

Here's the code! (Anyone else want to use this feel free)


'''
The following simply checks a file of links (new.txt) against an existing file of links (existing.txt)
and outputs any unique links that are found (output.txt).
The existing.txt file is then updated ready to be checked against next time.
You should load your existing database links (basically your first batch of links) into the new.txt file first and not the existing.txt
file as the links go through a 'cleaning process.' Then the existing.txt file will populate with the 'cleaned links'
'''
import codecs
import time
from tqdm import * # pip install tqdm

existing_links = [] # existing database of links
new_links = [] # new database of links

# Opens the existing file of links and checks / removes coding and adds each line to a list
with open('existing.txt') as existing:
for url in existing.readlines():
    if url.startswith(codecs.BOM_UTF8):
        url = url[3:]
        existing_links.append(url)
    else:
        existing_links.append(url)

# The following opens the new file of links and strips away http, https and www and also appends a forward slash to the end of all urls so they match regardless of current condition.

# Each line is then added to the new_links list. The stripping process is only needed on the new links as you should load the existing database as new links first. This should keep the code cleaner
# and help it run faster.

with open('new.txt', 'r') as new_download:
    for url in new_download.readlines():
        if url.startswith(codecs.BOM_UTF8):
            url = url[3:]
            url = url.replace('http://', '').replace('https://', '').replace('www.', '')
            if url.endswith('/\n'):
                new_links.append(url)
            else:
                x = url.replace('\n', '/\n')
                new_links.append(x)
        else:
            url = url.replace('http://', '').replace('https://', '').replace('www.', '')
            if url.endswith('/\n'):
                new_links.append(url)
            else:
                x = url.replace('\n', '/\n')
                new_links.append(x)

# this finds the difference between the two lists aka the new links
x = set(new_links).difference(existing_links)

# this is just a fancy progress bar for the terminal so you know how long it's going to take (see import module)
time_length = len(new_links)

for i in tqdm(range(time_length)):
    time.sleep(.01)

# this is the output file for the new links that were identified along with a data stamp in the file name
date = time.strftime("%d_%m_%Y")

f = open("output_" + date + ".txt", "w")
    f.writelines(x)
    f.close()

# this appends the newly found links to the existing database of links ready for next time.

append_existing = open('existing.txt', 'a')
append_existing.writelines(x)
append_existing.close()

print 'done'
raw_input() # stops the terminal from closing when running from windows
Update

So I asked for some help on making this program run quicker, the code above can check around a million urls in an hour and a half, it turns out the above code is actually fast simply by stripping out the fancy terminal progress bar I added on lines 55 - 58, simply removing this now checks a milliion urls in around 7.5 seconds!

The new code then would look like this...


import codecs
import time

existing_links = [] # existing database of links
new_links = [] # new database of links

with open('existing.txt') as existing:
    for url in existing.readlines():
        if url.startswith(codecs.BOM_UTF8):
            url = url[3:]
            existing_links.append(url)
        else:
            existing_links.append(url)
            with open('new.txt', 'r') as new_download:
                for url in new_download.readlines():
                    if url.startswith(codecs.BOM_UTF8):
                        url = url[3:]
                        url = url.replace('http://', '').replace('https://', '').replace('www.', '')
                        if url.endswith('/\n'):
                            new_links.append(url)
                        else:
                            x = url.replace('\n', '/\n')
                            new_links.append(x)
                    else:
                        url = url.replace('http://', '').replace('https://', '').replace('www.', '')
                        if url.endswith('/\n'):
                            new_links.append(url)
                        else:
                            x = url.replace('\n', '/\n')
                            new_links.append(x)
                            x = set(new_links).difference(existing_links)
date = time.strftime("%d_%m_%Y")

f = open("output_" + date + ".txt", "w")
f.writelines(x)
f.close()
append_existing = open('existing.txt', 'a')
append_existing.writelines(x)
append_existing.close()
print 'done'
raw_input()

You can see the discussion of this here.

Marc Poulin also offered another rewrite of the script to make it even quicker, seen here and below...


'''
The following simply checks a file of links (new.txt) against an existing file of links (existing.txt)
and outputs any unique links that are found (output.txt).
The existing.txt file is then updated ready to be checked against next time.
You should load your existing database links (basically your first batch of links) into the new.txt file first and not the existing.txt
file as the links go through a 'cleaning process.' Then the existing.txt file will populate with the 'cleaned links'
'''
import codecs
import time
from urlparse import urlparse

def strip_bom(lines):
''' Return a list of strings with the leading BOM_UTF8 removed. '''
    return [ line if not line.startswith(codecs.BOM_UTF8) else line[3:] for line in lines]

def load_existing():
''' Return a list of existing URLs with leading BOM_UTF8 removed. '''
    with open('existing.txt') as f:
        return strip_bom(f.readlines())

def load_new():
''' Return a list of new URLs that have been normalized. '''
    with open('new.txt', 'r') as f:
    lines = strip_bom(f.readlines())
    lines = [line.rstrip('\n') for line in lines]
    urls = [urlparse(line).netloc+'/\n' for line in lines]
    urls = [url if not url.startswith('www.') else url[4:] for url in urls]
    return urls

def log_new_links(links):
''' Write unique new links to a file. '''
    date = time.strftime("%d_%m_%Y")
    with open("output_" + date + ".txt", "w") as f:
        f.writelines(links)

def append_to_existing(lines):
''' Append unique new links to existing links. '''
# this appends the newly found links to the existing database of links ready for next time.
    with open('existing.txt', 'a') as f:
        f.writelines(lines)

def main():
    new, existing = load_new(), load_existing()
    links = set(new).difference(existing)
    log_new_links(links)
    append_to_existing(links)

main()
print 'done'

Thanks to everyone else who gave me suggestions too :)

About the author

Image

Craig Addyman @craigaddyman
Head of Digital Marketing. Python Coder.