PHP
  Home arrow PHP arrow Page 5 - Watching The Web
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PHP

Watching The Web
By: The Disenchanted Developer, (c) Melonfire
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 7
    2002-10-23


    Table of Contents:
  • Watching The Web
  • Code Poet
  • Digging Deep
  • Backtracking
  • Plan B
  • Closing Time

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Watching The Web - Plan B
    ( Page 5 of 6 )

    My next step, therefore, was to redesign my database table to support my new design. Here's the updated schema:






    CREATE TABLE urls (
      id tinyint(3) unsigned NOT NULL auto_increment,
      url text NOT NULL,
      dsc varchar(255) NOT NULL default '',
      md5 varchar(255) NOT NULL default '',
      email varchar(255) NOT NULL default '',
      PRIMARY KEY (id)
    );
    Notice that I've replaced the original "date" column with one that holds the MD5 checksum for the page being monitored. Why? I figured that I could save myself a little disk space (and the time spent on designing a comparison algorithm) by using the MD5 checksum features built into PHP.

    With the database schema updated, the next step is to update the PHP script that does all the work:
    <?php
    // set up database access parameters
    $db_host="localhost";
    $db_user="joe";
    $db_pass="65h49";
    $db_name="db167";
    
    // open connection to database
    $connection = mysql_connect($db_host, $db_user, $db_pass) or die
    ("Unable to connect!"); mysql_select_db($db_name);
    
    // generate and execute query
    $query1 = "SELECT id, url, dsc, md5, email FROM urls";
    $result1 = mysql_query($query1, $connection) or die ("Error in query:
    $query1 . " . mysql_error());
    
    // if rows exist
    if (mysql_num_rows($result1) > 0)
    {
    	// iterate through resultset
    	while(list($id, $url, $desc, $csum1, $email) =
    mysql_fetch_row($result1))
    	{
      // read page contents into a string
      $contents = join ('', file ($url));
      
      // calculate MD5 value
      $csum2 = md5($contents);
      
      // compare with earlier value
      if ($csum1 != $csum2)
      {
      	// send mail to owner 
      	mail($email, "$desc has changed!", "This is an
    automated message to inform you that the URL \r\n\r\n $url \r\n\r\nhas
    changed since it was last checked. Please visit the URL to view the
    changes.", "From: The Web Watcher
    <nobody@some.domain>") or die ("Could not send mail!");
      
      	// update database with new checksum if changed
      	$query2 = "UPDATE urls SET md5 = '$csum2' WHERE
    id = '" . $id . "'";
      	$result2 = mysql_query($query2, $connection) or
    die ("Error in query: $query2 . " . mysql_error());
    
      }	
    	}
    }
    
    // close database connection
    mysql_close($connection);
    ?>
    What's the difference between this script and the one I wrote earlier? This one retrieves the complete contents of the URL provided, using PHP's file() method, concatenates it into a single string, and creates a unique MD5 checksum to represent that string. This checksum is then compared to the checksum stored in the database from the last run; if they match, it implies that the URL has not changed at all.

    In case you're wondering what MD5 is, nope, it's not James Bond's employer, the following extract from http://www.faqs.org/rfcs/rfc1321.html might be enlightening: "The [MD5] algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest. The MD5 algorithm is intended for digital signature applications."

    Any change to the Web page located at that URL will result in a new checksum being generated the next time the script runs; this new checksum can be compared with the previous one and the results emailed to the concerned person, with the database simultaneously updated to reflect the new checksum.

    Here's the relevant section of code:
    // read page contents into a string
    $contents = join ('', file ($url));
    
    // calculate MD5 value
    $csum2 = md5($contents);
    
    // compare with earlier value
    if ($csum1 != $csum2)
    {
    	// send mail to owner 
    	mail($email, "$desc has changed!", "This is an automated message
    to inform you that the URL \r\n\r\n $url \r\n\r\nhas changed since it
    was last checked. Please visit the URL to view the changes.", "From: The
    Web Watcher
    <nobody@some.domain>") or die ("Could not send mail!");
      
    	// update database with new checksum if changed
    	$query2 = "UPDATE urls SET md5 = '$csum2' WHERE id = '" . $id .
    "'";
    	$result2 = mysql_query($query2, $connection) or die ("Error in
    query: $query2 . " . mysql_error());
    }
    Since this system does not depend on the presence or absence of specific headers in the HTTP response, it is far more reliable - though not as efficient - than the previous technique.

    All that's left is to put this script into the server "crontab", so that it runs once a day:
    5 0 * * *       /usr/local/bin/php -q webwatcher.php > /dev/null 2>&1
    Here's a sample message generated by the system:
    Date: Fri, 18 Oct 2002 15:12:52 +0530
    Subject: Melonfire.com has changed!
    To: user@some.domain
    From: The Web Watcher <nobody@some.domain>
    
    This is an automated message to inform you that the URL
    
     http://www.melonfire.com/
    
    has changed since it was last checked. Please visit the URL to view the
    changes.
    I have to warn you, though, that this system can substantially degrade performance on your server if you feed it a large number of URls to monitor. Even though MD5 is a pretty efficient algorithm, the time taken to connect to each URL, retrieve the contents, create a checksum and process the results can eat up quite a few processor cycles (one of the reasons why I'm running it at midnight via "cron"). If you're planning to use this in your own organization, limit the number of URLs to around thirty and try and give the script relatively-smaller-sized Web pages to track...or else you might find yourself rapidly scratched off your network administrator's Christmas card list.

     
     
    >>> More PHP Articles          >>> More By The Disenchanted Developer, (c) Melonfire
     

       

    PHP ARTICLES

    - Building Dynamic Queries with Chainable Meth...
    - PHP Encryption and Decryption Methods
    - Building a MySQL Abstraction Class with Meth...
    - Completing a Sample String Processor with Me...
    - Mastering WHILE Loops for PHP and MySQL
    - Method Chaining: Adding More Methods to the ...
    - Method Chaining in PHP 5
    - The Role of Interfaces in Applying the Depen...
    - Dependency Injection: Using a Setter Method ...
    - Using a Model Class with the Dependency Inje...
    - Injecting Objects Using Setter Methods with ...
    - Injecting Objects by Constructor with the De...
    - The Dependency Injection Design Pattern in P...
    - Performing Inferential Statistical Analysis ...
    - Performing Descriptive Statistical Analysis ...





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 5 Hosted by Hostway
    Stay green...Green IT