Watching The Web (
Page 1 of 6 )
Ever wondered if you could be emailed automatically whenever your
favorite Web pages changed? Our intrepid developer didn't just wonder -
he sat down and wrote some code to make it happen. Here's his story.So there I was, minding my own business, working on a piece of code I had to
deliver that evening, when the pretty dark-haired girl who sits in the cubicle
behind me popped her head over and asked for my help.
"Look", she said,
"I need your help with something. Can you write me a little piece of code that
keeps track of Web site URLs and tells me when they change?"
"Huh?", was
my first reaction...
"It's like this", she explained, "As part of a
content update contract, I'm in charge of tracking changes to about thirty
different Web sites for a customer, and sending out a bulletin with those
changes. Every day, I spend the morning visiting each site and checking to see
if it's changed. It's very tedious, and it really screws up my day. Do you think
you can write something to automate it for me?"
Now, she's a pretty
girl...and the problem intrigued me. So I agreed.{mospagebreak title=A Little
Research} The problem, of course, appeared when I actually started work on her
request. I had a vague idea how this might work: all I had to do, I reasoned,
was write a little script that woke up each morning, scanned her list of URLs,
downloaded the contents of each, compared those contents with the versions
downloaded previously, and sent out an email alert if there was a change.
Seemed simple - but how hard would it be to implement? I didn't really
like the thought of downloading and saving different versions of each page on a
daily basis, or of creating a comparison algorithm to test Web pages against
each other.
I thought there ought to be an easier way. Maybe the Web
server had a way of telling me if a Web page had been modified recently - and
all I had to do was read that data and use it in a script. Accordingly, my first
step was to hit the W3C Web site, download a copy of the HTTP protocol
specification, from ftp://ftp.isi.edu/in-notes/rfc2616.txt, and print it out for
a little bedside reading. Here's what I found, halfway through:
The Last-Modified entity-header field indicates
the date and time at which the origin server believes the variant was last
modified.There we go, I thought - the guys who came up with
the protocol obviously anticipated this requirement and built it into the
protocol headers. Now to see if it worked...
The next day at work, I
fired up my trusty telnet client and tried to connect to our intranet Web server
and request a page. Here's the session dump:
$ telnet darkstar 80
Trying 192.168.0.10...
Connected to darkstar.melonfire.com.
Escape character is '^]'.
HEAD / HTTP/1.0
HTTP/1.1 200 OK
Date: Fri, 18 Oct 2002 08:47:57 GMT
Server: Apache/1.3.26 (Unix) PHP/4.2.2
Last-Modified: Wed, 09 Oct 2002 11:27:23 GMT
Accept-Ranges: bytes
Content-Length: 1446
Connection: close
Content-Type: text/html
Connection closed by foreign host.
As you can see, the Web
server returned a "Last-Modified" header indicating the date of last change of
the requested file. So far so good.