So there was this web page, nothing more than a text file in fact, which I was compelled to check regularly. I had ordered something online from a company whose web-store design was still in the dark ages, and their “method” of letting customers know whether or not their products had shipped was to periodically update a web-viewable text file with a public list of all the order numbers that had recently shipped. Very private and secure, I know.
No, no, they couldn’t possibly email this information to you — you had to check the page manually. And if your products were backordered, as mine were, you’d be checking often.
Well, that’s just stupid, I thought, and since I had a computer sitting right in front of me, I thought it might be a good idea to automate the process.
So I cobbled together a little Python script on my Ubuntu box, which uses the pycURL interface to grab the URL in question, parse it, search it for my order number, and then fire me off an email (using sendmail) if and when it finds anything.
I automated the script to run hourly using cron, like so:
crontab -e
and then included the following line in my crontab file:
# m h dom mon dow command
7 * * * * /path/to/searchagent.py
Seven minutes after the hour (for luck), every hour, every day.
On the hunch that this script might be useful to somebody else, I decided to post it here. Fair warning to all: this is a huge hack, and I’m sure it needs work. Also, there’s probably a better method that I’ve completely overlooked. Comments, corrections, and suggestions are welcome, as always.
Here’s the script:
#!/usr/bin/env python ############################# # Module: searchagent.py # Author: Brian D. Wendt # Date: 2008/08/15 # Version: Draft 0.3 ''' Searches for a specified string in a supplied webpage, then sends an email when found. Requires Linux 'sendmail' and 'pycurl' module. On Ubuntu, try 'sudo apt-get install sendmail python-pycurl' ''' ################################ import sys, os import pycurl ##################################################### ### Edit these variables! ########################### ##################################################### # webpage to search url = 'http://reddit.com/' # string to find in the webpage query_string = "LOLcats" # email message particulars to_name = "Harry S. Noob" to_email = "[email protected]" from_name = "Python Search Agent" from_email = "[email protected]" subject = "We've Found Something!" body = "Your search query %s was found in a scan of %s." % (query_string, url) ##################################################### ### The mail-sending function (requires sendmail) ### ##################################################### def sendmail(to_name, to_email, from_name, from_email, subject, body): # full path to sendmail mailerdaemon = "/usr/sbin/sendmail" # logfile when email sent (so you only send once!) email_log = "searchagent.log" # format the message mail = "To: \"%s\" <%s>\nFrom: \"%s\" <%s>\nSubject: %s\n\n%s" % (to_name, to_email, from_name, from_email, subject, body) # check to see if the email logfile exists try: logfile = open(email_log, 'r') email_sent = logfile.readline() except: email_sent = "" # if no mail's been sent, send one if email_sent == "": # open a pipe to sendmail and write message to the pipe print "Sending email..." p = os.popen("%s -t" % mailerdaemon, 'w') p.write(mail) exitcode = p.close() if exitcode: print "Oops! There was an error: %s" % exitcode else: print "Mail sent!" # log the sending, so you don't send again log = open(email_log, 'w') log.write("MAIL SENT") else: print "Email already sent! Exiting..." ##################################################### ### Read the webpage using pyCURL ################### ##################################################### # some CURL options (pose as a Mozilla browser) USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0)' headers = ['Cache-control: max-age=0', 'Pragma: no-cache', 'Connection: Keep-Alive'] # create temporary textfile for results tempfile = "TEMP-HTML.txt" outputfile = open(tempfile, "w") # keep track of the result count and results list results_count = 0 results_list = [] # set up CURL object c = pycurl.Curl() c.setopt(pycurl.URL, url) c.setopt(pycurl.HEADER, 1) c.setopt(pycurl.USERAGENT, USER_AGENT) c.setopt(pycurl.FOLLOWLOCATION, 1) c.setopt(pycurl.CONNECTTIMEOUT, 40) c.setopt(pycurl.TIMEOUT, 300) c.setopt(pycurl.FILE, outputfile); c.perform() outputfile.close() # now search the file line by line file = open(tempfile, "r") for line in file.readlines(): search = line.find(query_string) if search != -1: # store results in a list, if desired # results_list.append(line.strip()) results_count += 1 # if there's any result, send an email if results_count > 0: print "Found a match. ", sendmail(to_name, to_email, from_name, from_email, subject, body) else: print "No matches found." # delete temporary textfile os.remove(tempfile)
