Perl
  Home arrow Perl arrow Page 6 - Web Mining with Perl
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
Google.com  
PERL

Web Mining with Perl
By: Tommie Jones
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 51
    2002-03-05


    Table of Contents:
  • Web Mining with Perl
  • Accessing The Net (LWP)
  • Cut Along The Table Lines (HTML::TableExtract)
  • Learning From Links (HTML::LinkExtor)
  • Checking For Sameness (String::CRC)
  • Bringing It All Together
  • Conclusion

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Web Mining with Perl - Bringing It All Together
    ( Page 6 of 7 )

    The following code is a Perl script that uses the discussed modules (except HTML::Parser). For this script you give it three URLs. All three URLs should refer to different pages on a single web site.

    The first two URLs are divided into tables and their cells contents are compared to each other. When the contents of the cells of each table match they are assumed to make up the template. If the cells content is different between the two tables then these cells are identified as the dynamic content. Each table is stored in a four dimensional array. The first two dimensions identify a table (depth, count) as discussed in the HTML::TableExtract section. The last two dimensions refer too a particular cell in each table (row, column). The cells are compared to each other and any cell where the content differs between the tow pages is identified and stored in the @cells array. The theory is that cells that contain the site's menus and other template type content will be ignored. The cells that contain content that changes from one page to the next will be recorded.

    The third URL is used to test the content extraction theory. The HTML::LinkExtor first parses the contents of each changed cell. This finds the html links stored in the content. The content is then stripped of html and printed to the screen. Last of all the links found in the content is printed to the screen.

    #!/usr/bin/Perl use lib qw(. ..); use HTML::TableExtract; use LWP::Simple; use String::CRC; use Data::Dumper; use HTML::LinkExtor; # Data Entry Portion print "Enter first URL: "; my $url = <>; chomp $url; my $t1 = pageParse($url, 1, 0); #Parse out tables in first URL print "Enter next URL:"; my $url2 = <>; my $t2 = pageParse($url2, 1, 0); # Parse out tables in second URL my ($depth, $count, $row, $col); # Loop through elements of array and find the cells that do not # have equivalent content for ($depth=0;$depth< max(scalar(@$t1), scalar(@$t2)); $depth++) { for ($count=0;$count< max(scalar(@{$t1->[$depth]}), scalar(@{$t2->[$depth]})); $count++) { for ($row=0; $row < max(scalar(@{$t1->[$depth][$count]}), scalar(@{$t2->[$depth][$count]})); $row++) { for ($col=0; $col< max(scalar(@{$t1->[$depth][$count][$row]}), scalar(@{$t2->[$depth][$count][$row]})); $col++) { if (defined $t2->[$depth][$count][$row][$col]) { if ($t1->[$depth][$count][$row][$col] ne $t2->[$depth][$count][$row][$col]) { print " Cell $depth $count $row $col differs\n"; push @cells, [$depth, $count, $row, $col]; } #if ($t1->[$depth][$count][$row][$col] ne $t2->[$depth][$count][$row][$col]) } # if (defined $t2->[$depth][$count][$row][$col]) } #for $col } #for $row } #for $count } #for $depth print "Enter URL You want to rip links from:"; $url = <>; chomp $url; my $tab = pageParse($url, 0, 1); foreach my $coords (@cells) { my ($depth, $count, $row, $col) = @$coords; my $linkParser = HTML::LinkExtor->new(); my $content = $tab->[$depth][$count][$row][$col]; $linkParser->parse($content); $content =~ s/<.*?>//g; my @links = $linkParser->links; # get Links print "-----Depth $depth ; Count $count ; Row $row ; Col $col \n"; print $content; print "-----Links:\n"; foreach my $link (@links) { my $tag = shift @$link; if ($tag eq 'a') { my %linkHash = @$link; print $linkHash{href}, "\n" } } print "-----END CONTENT\n"; } #Parses HTML page and store resulting tables # into a four dimensional array. sub pageParse { my $url=shift; my $func = shift ; my $keep_html = shift || 0; my $te = new HTML::TableExtract( depth=>0, gridmap=>0, keep_html=> $keep_html, br_translate=>1); chomp $url; my $content = get($url); $te->parse($content); my $tables=[]; # Loop through All tables on page foreach my $ts ($te->table_states()) { my $row_idx =0; # Loop through rows for a table foreach my $row ($ts->rows) { my $col_idx =0; foreach my $column ( @$row) # Loop through columns in row. { if ( $func) { $column =~ s/\s//g; my $crc= crc($column, 32); # Build checksum $column = $crc; } else { $column =~ s/\s+/ /g; } $tables->[$ts->depth()][$ts->count()][$row_idx][$col_idx] = $column; $col_idx++; } $row_idx++; } } return $tables; } sub max # returns max of two values { my ($x1, $x2) = @_; return $x1 if ($x1 gt $x2); return $x2; }
    Obviously, the previous script is not very practical However it could be modified and be very useful.

    With a few changes you can create an automated personal newsletter. Instead of asking for three URLs the script could be modified to watch one particular site. After the first execution the generated hash from the pageParse subroutine could be stored off. The next time the script is run the new pageParse result could be compared to the original. If content is different in any of table cells the content in that cell could be emailed thus creating an automated newsletter.

     
     
    >>> More Perl Articles          >>> More By Tommie Jones
     

       

    PERL ARTICLES

    - More Perl Bits
    - Perl, Bit by Bit
    - Basic Charting with Perl
    - Using Getopt::Long: More Command Line Option...
    - Command Line Options in Perl: Using Getopt::...
    - Web Access with LWP
    - More Templating Tools for Perl
    - Site Layout with Perl Templating Tools
    - Build a Perl RSS Aggregator with Templating ...
    - Looping, Security, and Templating Tools
    - Perl: Bon Voyage Lists and Hashes
    - Templating Tools
    - Perl: Number Crunching
    - Perl Debuggers in Detail
    - Debugging Perl





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 5 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek