Home arrow Perl Programming arrow Page 6 - Web Mining with Perl

Bringing It All Together - Perl

It is common knowledge that the Internet is a great data source. It is alsocommon knowledge that it is difficult to get the information you want in the format you need. No longer.

  1. Web Mining with Perl
  2. Accessing The Net (LWP)
  3. Cut Along The Table Lines (HTML::TableExtract)
  4. Learning From Links (HTML::LinkExtor)
  5. Checking For Sameness (String::CRC)
  6. Bringing It All Together
  7. Conclusion
By: Tommie Jones
Rating: starstarstarstarstar / 54
March 05, 2002

print this article


The following code is a Perl script that uses the discussed modules (except HTML::Parser). For this script you give it three URLs. All three URLs should refer to different pages on a single web site.

The first two URLs are divided into tables and their cells contents are compared to each other. When the contents of the cells of each table match they are assumed to make up the template. If the cells content is different between the two tables then these cells are identified as the dynamic content. Each table is stored in a four dimensional array. The first two dimensions identify a table (depth, count) as discussed in the HTML::TableExtract section. The last two dimensions refer too a particular cell in each table (row, column). The cells are compared to each other and any cell where the content differs between the tow pages is identified and stored in the @cells array. The theory is that cells that contain the site's menus and other template type content will be ignored. The cells that contain content that changes from one page to the next will be recorded.

The third URL is used to test the content extraction theory. The HTML::LinkExtor first parses the contents of each changed cell. This finds the html links stored in the content. The content is then stripped of html and printed to the screen. Last of all the links found in the content is printed to the screen.

#!/usr/bin/Perl use lib qw(. ..); use HTML::TableExtract; use LWP::Simple; use String::CRC; use Data::Dumper; use HTML::LinkExtor; # Data Entry Portion print "Enter first URL: "; my $url = <>; chomp $url; my $t1 = pageParse($url, 1, 0); #Parse out tables in first URL print "Enter next URL:"; my $url2 = <>; my $t2 = pageParse($url2, 1, 0); # Parse out tables in second URL my ($depth, $count, $row, $col); # Loop through elements of array and find the cells that do not # have equivalent content for ($depth=0;$depth< max(scalar(@$t1), scalar(@$t2)); $depth++) { for ($count=0;$count< max(scalar(@{$t1->[$depth]}), scalar(@{$t2->[$depth]})); $count++) { for ($row=0; $row < max(scalar(@{$t1->[$depth][$count]}), scalar(@{$t2->[$depth][$count]})); $row++) { for ($col=0; $col< max(scalar(@{$t1->[$depth][$count][$row]}), scalar(@{$t2->[$depth][$count][$row]})); $col++) { if (defined $t2->[$depth][$count][$row][$col]) { if ($t1->[$depth][$count][$row][$col] ne $t2->[$depth][$count][$row][$col]) { print " Cell $depth $count $row $col differs\n"; push @cells, [$depth, $count, $row, $col]; } #if ($t1->[$depth][$count][$row][$col] ne $t2->[$depth][$count][$row][$col]) } # if (defined $t2->[$depth][$count][$row][$col]) } #for $col } #for $row } #for $count } #for $depth print "Enter URL You want to rip links from:"; $url = <>; chomp $url; my $tab = pageParse($url, 0, 1); foreach my $coords (@cells) { my ($depth, $count, $row, $col) = @$coords; my $linkParser = HTML::LinkExtor->new(); my $content = $tab->[$depth][$count][$row][$col]; $linkParser->parse($content); $content =~ s/<.*?>//g; my @links = $linkParser->links; # get Links print "-----Depth $depth ; Count $count ; Row $row ; Col $col \n"; print $content; print "-----Links:\n"; foreach my $link (@links) { my $tag = shift @$link; if ($tag eq 'a') { my %linkHash = @$link; print $linkHash{href}, "\n" } } print "-----END CONTENT\n"; } #Parses HTML page and store resulting tables # into a four dimensional array. sub pageParse { my $url=shift; my $func = shift ; my $keep_html = shift || 0; my $te = new HTML::TableExtract( depth=>0, gridmap=>0, keep_html=> $keep_html, br_translate=>1); chomp $url; my $content = get($url); $te->parse($content); my $tables=[]; # Loop through All tables on page foreach my $ts ($te->table_states()) { my $row_idx =0; # Loop through rows for a table foreach my $row ($ts->rows) { my $col_idx =0; foreach my $column ( @$row) # Loop through columns in row. { if ( $func) { $column =~ s/\s//g; my $crc= crc($column, 32); # Build checksum $column = $crc; } else { $column =~ s/\s+/ /g; } $tables->[$ts->depth()][$ts->count()][$row_idx][$col_idx] = $column; $col_idx++; } $row_idx++; } } return $tables; } sub max # returns max of two values { my ($x1, $x2) = @_; return $x1 if ($x1 gt $x2); return $x2; }
Obviously, the previous script is not very practical However it could be modified and be very useful.

With a few changes you can create an automated personal newsletter. Instead of asking for three URLs the script could be modified to watch one particular site. After the first execution the generated hash from the pageParse subroutine could be stored off. The next time the script is run the new pageParse result could be compared to the original. If content is different in any of table cells the content in that cell could be emailed thus creating an automated newsletter.

>>> More Perl Programming Articles          >>> More By Tommie Jones

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Perl Turns 25
- Lists and Arguments in Perl
- Variables and Arguments in Perl
- Understanding Scope and Packages in Perl
- Arguments and Return Values in Perl
- Invoking Perl Subroutines and Functions
- Subroutines and Functions in Perl
- Perl Basics: Writing and Debugging Programs
- Structure and Statements in Perl
- First Steps in Perl
- Completing Regular Expression Basics
- Modifiers, Boundaries, and Regular Expressio...
- Quantifiers and Other Regular Expression Bas...
- Parsing and Regular Expression Basics
- Hash Functions

Developer Shed Affiliates


Dev Shed Tutorial Topics: