Perl
  Home arrow Perl arrow Page 6 - Web Mining with Perl
Dev Shed Forums 
Administration  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Dedicated Servers 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Download TestComplete 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
IBM Rational Software Development Conference
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PERL

Web Mining with Perl
By: Tommie Jones
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 47
    2002-03-05

    Table of Contents:
  • Web Mining with Perl
  • Accessing The Net (LWP)
  • Cut Along The Table Lines (HTML::TableExtract)
  • Learning From Links (HTML::LinkExtor)
  • Checking For Sameness (String::CRC)
  • Bringing It All Together
  • Conclusion

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
     
    ADVERTISEMENT

    PCmover - $15 Off with Coupon Code CJPH7Q

    Web Mining with Perl - Bringing It All Together
    (Page 6 of 7 )

    The following code is a Perl script that uses the discussed modules (except HTML::Parser). For this script you give it three URLs. All three URLs should refer to different pages on a single web site.

    The first two URLs are divided into tables and their cells contents are compared to each other. When the contents of the cells of each table match they are assumed to make up the template. If the cells content is different between the two tables then these cells are identified as the dynamic content. Each table is stored in a four dimensional array. The first two dimensions identify a table (depth, count) as discussed in the HTML::TableExtract section. The last two dimensions refer too a particular cell in each table (row, column). The cells are compared to each other and any cell where the content differs between the tow pages is identified and stored in the @cells array. The theory is that cells that contain the site's menus and other template type content will be ignored. The cells that contain content that changes from one page to the next will be recorded.

    The third URL is used to test the content extraction theory. The HTML::LinkExtor first parses the contents of each changed cell. This finds the html links stored in the content. The content is then stripped of html and printed to the screen. Last of all the links found in the content is printed to the screen.

    #!/usr/bin/Perl use lib qw(. ..); use HTML::TableExtract; use LWP::Simple; use String::CRC; use Data::Dumper; use HTML::LinkExtor; # Data Entry Portion print "Enter first URL: "; my $url = <>; chomp $url; my $t1 = pageParse($url, 1, 0); #Parse out tables in first URL print "Enter next URL:"; my $url2 = <>; my $t2 = pageParse($url2, 1, 0); # Parse out tables in second URL my ($depth, $count, $row, $col); # Loop through elements of array and find the cells that do not # have equivalent content for ($depth=0;$depth< max(scalar(@$t1), scalar(@$t2)); $depth++) { for ($count=0;$count< max(scalar(@{$t1->[$depth]}), scalar(@{$t2->[$depth]})); $count++) { for ($row=0; $row < max(scalar(@{$t1->[$depth][$count]}), scalar(@{$t2->[$depth][$count]})); $row++) { for ($col=0; $col< max(scalar(@{$t1->[$depth][$count][$row]}), scalar(@{$t2->[$depth][$count][$row]})); $col++) { if (defined $t2->[$depth][$count][$row][$col]) { if ($t1->[$depth][$count][$row][$col] ne $t2->[$depth][$count][$row][$col]) { print " Cell $depth $count $row $col differs\n"; push @cells, [$depth, $count, $row, $col]; } #if ($t1->[$depth][$count][$row][$col] ne $t2->[$depth][$count][$row][$col]) } # if (defined $t2->[$depth][$count][$row][$col]) } #for $col } #for $row } #for $count } #for $depth print "Enter URL You want to rip links from:"; $url = <>; chomp $url; my $tab = pageParse($url, 0, 1); foreach my $coords (@cells) { my ($depth, $count, $row, $col) = @$coords; my $linkParser = HTML::LinkExtor->new(); my $content = $tab->[$depth][$count][$row][$col]; $linkParser->parse($content); $content =~ s/<.*?>//g; my @links = $linkParser->links; # get Links print "-----Depth $depth ; Count $count ; Row $row ; Col $col \n"; print $content; print "-----Links:\n"; foreach my $link (@links) { my $tag = shift @$link; if ($tag eq 'a') { my %linkHash = @$link; print $linkHash{href}, "\n" } } print "-----END CONTENT\n"; } #Parses HTML page and store resulting tables # into a four dimensional array. sub pageParse { my $url=shift; my $func = shift ; my $keep_html = shift || 0; my $te = new HTML::TableExtract( depth=>0, gridmap=>0, keep_html=> $keep_html, br_translate=>1); chomp $url; my $content = get($url); $te->parse($content); my $tables=[]; # Loop through All tables on page foreach my $ts ($te->table_states()) { my $row_idx =0; # Loop through rows for a table foreach my $row ($ts->rows) { my $col_idx =0; foreach my $column ( @$row) # Loop through columns in row. { if ( $func) { $column =~ s/\s//g; my $crc= crc($column, 32); # Build checksum $column = $crc; } else { $column =~ s/\s+/ /g; } $tables->[$ts->depth()][$ts->count()][$row_idx][$col_idx] = $column; $col_idx++; } $row_idx++; } } return $tables; } sub max # returns max of two values { my ($x1, $x2) = @_; return $x1 if ($x1 gt $x2); return $x2; }
    Obviously, the previous script is not very practical However it could be modified and be very useful.

    With a few changes you can create an automated personal newsletter. Instead of asking for three URLs the script could be modified to watch one particular site. After the first execution the generated hash from the pageParse subroutine could be stored off. The next time the script is run the new pageParse result could be compared to the original. If content is different in any of table cells the content in that cell could be emailed thus creating an automated newsletter.

    More Perl Articles
    More By Tommie Jones


     

       

    PERL ARTICLES

    - Perl: A Continuing Look at Hashes and Multid...
    - Perl: Another Round with Hashes
    - Perl Hashes
    - Perl Lists: A Final Look at List::Util
    - Perl Lists: Utilizing List::Util
    - Perl Lists: The Split() Function
    - SQL and CGI with Perl and DBI
    - Perl Lists: More Functions and Operators
    - SELECT Queries and Perl
    - Perl Lists: More on Manipulation
    - Creating a Database with Perl and DBI
    - Perl: Sailing the List(less) Seas
    - Perl and DBI
    - Perl: Concatenating Text and More
    - Perl Text: Quoting Without Quote Marks

     
    Accelerating Trading Partner Performance
     
    Competing on Analytics
     
    Cost Effective Scaling with Virtualization and Coyote Point Systems
     
    Five Checkpoints to Implementing IP Telephony
     
    Hosted Email Security: Staying Ahead of New Threats
     




    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway