PHP
  Home arrow PHP arrow Page 5 - Watching The Web
Dev Shed Forums 
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Sun Developer Network 
Dedicated Servers 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Moblin 
JMSL Numerical Library 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PHP

Watching The Web
By: The Disenchanted Developer, (c) Melonfire
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 6
    2002-10-23

    Table of Contents:
  • Watching The Web
  • Code Poet
  • Digging Deep
  • Backtracking
  • Plan B
  • Closing Time

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Watching The Web - Plan B


    (Page 5 of 6 )

    My next step, therefore, was to redesign my database table to support my new design. Here's the updated schema:






    CREATE TABLE urls (
    id tinyint(3) unsigned NOT NULL auto_increment,
    url text NOT NULL,
    dsc varchar(255) NOT NULL default '',
    md5 varchar(255) NOT NULL default '',
    email varchar(255) NOT NULL default '',
    PRIMARY KEY (id)
    );
    Notice that I've replaced the original "date" column with one that holds the MD5 checksum for the page being monitored. Why? I figured that I could save myself a little disk space (and the time spent on designing a comparison algorithm) by using the MD5 checksum features built into PHP.

    With the database schema updated, the next step is to update the PHP script that does all the work:
    <?php
    // set up database access parameters
    $db_host="localhost";
    $db_user="joe";
    $db_pass="65h49";
    $db_name="db167";
    // open connection to database
    $connection = mysql_connect($db_host, $db_user, $db_pass) or die
    ("Unable to connect!"); mysql_select_db($db_name);
    // generate and execute query
    $query1 = "SELECT id, url, dsc, md5, email FROM urls";
    $result1 = mysql_query($query1, $connection) or die ("Error in query:
    $query1 . " . mysql_error());
    // if rows exist
    if (mysql_num_rows($result1) > 0)
    {
    // iterate through resultset
    while(list($id, $url, $desc, $csum1, $email) =
    mysql_fetch_row($result1))
    {
    // read page contents into a string
    $contents = join ('', file ($url));
    // calculate MD5 value
    $csum2 = md5($contents);
    // compare with earlier value
    if ($csum1 != $csum2)
    {
    // send mail to owner 
    mail($email, "$desc has changed!", "This is an
    automated message to inform you that the URL \r\n\r\n $url \r\n\r\nhas
    changed since it was last checked. Please visit the URL to view the
    changes.", "From: The Web Watcher
    <nobody@some.domain>") or die ("Could not send mail!");
    // update database with new checksum if changed
    $query2 = "UPDATE urls SET md5 = '$csum2' WHERE
    id = '" . $id . "'";
    $result2 = mysql_query($query2, $connection) or
    die ("Error in query: $query2 . " . mysql_error());
    } 
    }
    }
    // close database connection
    mysql_close($connection);
    ?>
    What's the difference between this script and the one I wrote earlier? This one retrieves the complete contents of the URL provided, using PHP's file() method, concatenates it into a single string, and creates a unique MD5 checksum to represent that string. This checksum is then compared to the checksum stored in the database from the last run; if they match, it implies that the URL has not changed at all.

    In case you're wondering what MD5 is, nope, it's not James Bond's employer, the following extract from http://www.faqs.org/rfcs/rfc1321.html might be enlightening: "The [MD5] algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest. The MD5 algorithm is intended for digital signature applications."

    Any change to the Web page located at that URL will result in a new checksum being generated the next time the script runs; this new checksum can be compared with the previous one and the results emailed to the concerned person, with the database simultaneously updated to reflect the new checksum.

    Here's the relevant section of code:
    // read page contents into a string
    $contents = join ('', file ($url));
    // calculate MD5 value
    $csum2 = md5($contents);
    // compare with earlier value
    if ($csum1 != $csum2)
    {
    // send mail to owner 
    mail($email, "$desc has changed!", "This is an automated message
    to inform you that the URL \r\n\r\n $url \r\n\r\nhas changed since it
    was last checked. Please visit the URL to view the changes.", "From: The
    Web Watcher
    <nobody@some.domain>") or die ("Could not send mail!");
    // update database with new checksum if changed
    $query2 = "UPDATE urls SET md5 = '$csum2' WHERE id = '" . $id .
    "'";
    $result2 = mysql_query($query2, $connection) or die ("Error in
    query: $query2 . " . mysql_error());
    }
    Since this system does not depend on the presence or absence of specific headers in the HTTP response, it is far more reliable - though not as efficient - than the previous technique.

    All that's left is to put this script into the server "crontab", so that it runs once a day:
    5 0 * * *       /usr/local/bin/php -q webwatcher.php > /dev/null 2>&1
    Here's a sample message generated by the system:
    Date: Fri, 18 Oct 2002 15:12:52 +0530
    Subject: Melonfire.com has changed!
    To: user@some.domain
    From: The Web Watcher <nobody@some.domain>
    This is an automated message to inform you that the URL
    http://www.melonfire.com/
    has changed since it was last checked. Please visit the URL to view the
    changes.
    I have to warn you, though, that this system can substantially degrade performance on your server if you feed it a large number of URls to monitor. Even though MD5 is a pretty efficient algorithm, the time taken to connect to each URL, retrieve the contents, create a checksum and process the results can eat up quite a few processor cycles (one of the reasons why I'm running it at midnight via "cron"). If you're planning to use this in your own organization, limit the number of URLs to around thirty and try and give the script relatively-smaller-sized Web pages to track...or else you might find yourself rapidly scratched off your network administrator's Christmas card list.

    More PHP Articles
    More By The Disenchanted Developer, (c) Melonfire


     

       

    PHP ARTICLES

    - Paginating Database Records with the Code Ig...
    - HTTP Headers in Web Development
    - Project Management: Administration
    - Building a Database-Driven Application with ...
    - User Authentication for a Project Management...
    - Introduction to the CodeIgniter PHP Framework
    - Adding Users for a Project Management Applic...
    - Migrating Class Code for a MIME Email to PHP...
    - Login and Logout Authentication for a Projec...
    - Composing Messages in HTML for MIME Email wi...
    - Project Management: Authentication
    - A Better Way to Determine MIME Types for MIM...
    - Project Management Overview
    - Handling Attachments in MIME Email with PHP
    - Completing the Project Management Application





    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway