Home arrow PHP arrow Page 5 - Watching The Web

Plan B - PHP

Ever wondered if you could be emailed automatically whenever yourfavorite Web pages changed? Our intrepid developer didn't just wonder -he sat down and wrote some code to make it happen. Here's his story.

TABLE OF CONTENTS:
  1. Watching The Web
  2. Code Poet
  3. Digging Deep
  4. Backtracking
  5. Plan B
  6. Closing Time
By: The Disenchanted Developer, (c) Melonfire
Rating: starstarstarstarstar / 7
October 23, 2002

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement
My next step, therefore, was to redesign my database table to support my new design. Here's the updated schema:






CREATE TABLE urls (
id tinyint(3) unsigned NOT NULL auto_increment,
url text NOT NULL,
dsc varchar(255) NOT NULL default '',
md5 varchar(255) NOT NULL default '',
email varchar(255) NOT NULL default '',
PRIMARY KEY (id)
);
Notice that I've replaced the original "date" column with one that holds the MD5 checksum for the page being monitored. Why? I figured that I could save myself a little disk space (and the time spent on designing a comparison algorithm) by using the MD5 checksum features built into PHP.

With the database schema updated, the next step is to update the PHP script that does all the work:
<?php
// set up database access parameters
$db_host="localhost";
$db_user="joe";
$db_pass="65h49";
$db_name="db167";
// open connection to database
$connection = mysql_connect($db_host, $db_user, $db_pass) or die
("Unable to connect!"); mysql_select_db($db_name);
// generate and execute query
$query1 = "SELECT id, url, dsc, md5, email FROM urls";
$result1 = mysql_query($query1, $connection) or die ("Error in query:
$query1 . " . mysql_error());
// if rows exist
if (mysql_num_rows($result1) > 0)
{
// iterate through resultset
while(list($id, $url, $desc, $csum1, $email) =
mysql_fetch_row($result1))
{
// read page contents into a string
$contents = join ('', file ($url));
// calculate MD5 value
$csum2 = md5($contents);
// compare with earlier value
if ($csum1 != $csum2)
{
// send mail to owner 
mail($email, "$desc has changed!", "This is an
automated message to inform you that the URL \r\n\r\n $url \r\n\r\nhas
changed since it was last checked. Please visit the URL to view the
changes.", "From: The Web Watcher
<nobody@some.domain>") or die ("Could not send mail!");
// update database with new checksum if changed
$query2 = "UPDATE urls SET md5 = '$csum2' WHERE
id = '" . $id . "'";
$result2 = mysql_query($query2, $connection) or
die ("Error in query: $query2 . " . mysql_error());
} 
}
}
// close database connection
mysql_close($connection);
?>
What's the difference between this script and the one I wrote earlier? This one retrieves the complete contents of the URL provided, using PHP's file() method, concatenates it into a single string, and creates a unique MD5 checksum to represent that string. This checksum is then compared to the checksum stored in the database from the last run; if they match, it implies that the URL has not changed at all.

In case you're wondering what MD5 is, nope, it's not James Bond's employer, the following extract from http://www.faqs.org/rfcs/rfc1321.html might be enlightening: "The [MD5] algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest. The MD5 algorithm is intended for digital signature applications."

Any change to the Web page located at that URL will result in a new checksum being generated the next time the script runs; this new checksum can be compared with the previous one and the results emailed to the concerned person, with the database simultaneously updated to reflect the new checksum.

Here's the relevant section of code:
// read page contents into a string
$contents = join ('', file ($url));
// calculate MD5 value
$csum2 = md5($contents);
// compare with earlier value
if ($csum1 != $csum2)
{
// send mail to owner 
mail($email, "$desc has changed!", "This is an automated message
to inform you that the URL \r\n\r\n $url \r\n\r\nhas changed since it
was last checked. Please visit the URL to view the changes.", "From: The
Web Watcher
<nobody@some.domain>") or die ("Could not send mail!");
// update database with new checksum if changed
$query2 = "UPDATE urls SET md5 = '$csum2' WHERE id = '" . $id .
"'";
$result2 = mysql_query($query2, $connection) or die ("Error in
query: $query2 . " . mysql_error());
}
Since this system does not depend on the presence or absence of specific headers in the HTTP response, it is far more reliable - though not as efficient - than the previous technique.

All that's left is to put this script into the server "crontab", so that it runs once a day:
5 0 * * *       /usr/local/bin/php -q webwatcher.php > /dev/null 2>&1
Here's a sample message generated by the system:
Date: Fri, 18 Oct 2002 15:12:52 +0530
Subject: Melonfire.com has changed!
To: user@some.domain
From: The Web Watcher <nobody@some.domain>
This is an automated message to inform you that the URL
http://www.melonfire.com/
has changed since it was last checked. Please visit the URL to view the
changes.
I have to warn you, though, that this system can substantially degrade performance on your server if you feed it a large number of URls to monitor. Even though MD5 is a pretty efficient algorithm, the time taken to connect to each URL, retrieve the contents, create a checksum and process the results can eat up quite a few processor cycles (one of the reasons why I'm running it at midnight via "cron"). If you're planning to use this in your own organization, limit the number of URLs to around thirty and try and give the script relatively-smaller-sized Web pages to track...or else you might find yourself rapidly scratched off your network administrator's Christmas card list.

 
 
>>> More PHP Articles          >>> More By The Disenchanted Developer, (c) Melonfire
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PHP ARTICLES

- Hackers Compromise PHP Sites to Launch Attac...
- Red Hat, Zend Form OpenShift PaaS Alliance
- PHP IDE News
- BCD, Zend Extend PHP Partnership
- PHP FAQ Highlight
- PHP Creator Didn't Set Out to Create a Langu...
- PHP Trends Revealed in Zend Study
- PHP: Best Methods for Running Scheduled Jobs
- PHP Array Functions: array_change_key_case
- PHP array_combine Function
- PHP array_chunk Function
- PHP Closures as View Helpers: Lazy-Loading F...
- Using PHP Closures as View Helpers
- PHP File and Operating System Program Execut...
- PHP: Effects of Wrapping Code in Class Const...

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: