Web Mining with Perl - Bringing It All Together (
Page 6 of 7 )
The following code is a Perl script that uses
the discussed modules (except HTML::Parser). For this script you give it three
URLs. All three URLs should refer to different pages on a single web site.
The first two URLs are divided into tables and their cells contents are
compared to each other. When the contents of the cells of each table match they
are assumed to make up the template. If the cells content is different between
the two tables then these cells are identified as the dynamic content. Each
table is stored in a four dimensional array. The first two dimensions identify a
table (depth, count) as discussed in the HTML::TableExtract section. The last
two dimensions refer too a particular cell in each table (row, column). The
cells are compared to each other and any cell where the content differs between
the tow pages is identified and stored in the @cells array. The theory is that
cells that contain the site's menus and other template type content will be
ignored. The cells that contain content that changes from one page to the next
will be recorded.
The third URL is used to test the content extraction
theory. The HTML::LinkExtor first parses the contents of each changed cell. This
finds the html links stored in the content. The content is then stripped of html
and printed to the screen. Last of all the links found in the content is printed
to the screen.
#!/usr/bin/Perl
use lib qw(. ..);
use HTML::TableExtract;
use LWP::Simple;
use String::CRC;
use Data::Dumper;
use HTML::LinkExtor;
# Data Entry Portion
print "Enter first URL: ";
my $url = <>;
chomp $url;
my $t1 = pageParse($url, 1, 0); #Parse out tables in first URL
print "Enter next URL:";
my $url2 = <>;
my $t2 = pageParse($url2, 1, 0); # Parse out tables in second URL
my ($depth, $count, $row, $col);
# Loop through elements of array and find the cells that do not
# have equivalent content
for ($depth=0;$depth< max(scalar(@$t1), scalar(@$t2)); $depth++)
{
for ($count=0;$count< max(scalar(@{$t1->[$depth]}),
scalar(@{$t2->[$depth]})); $count++)
{
for ($row=0;
$row < max(scalar(@{$t1->[$depth][$count]}),
scalar(@{$t2->[$depth][$count]}));
$row++)
{
for ($col=0;
$col< max(scalar(@{$t1->[$depth][$count][$row]}),
scalar(@{$t2->[$depth][$count][$row]}));
$col++)
{
if (defined $t2->[$depth][$count][$row][$col])
{
if ($t1->[$depth][$count][$row][$col]
ne $t2->[$depth][$count][$row][$col])
{
print " Cell $depth $count $row $col differs\n";
push @cells, [$depth, $count, $row, $col];
} #if ($t1->[$depth][$count][$row][$col]
ne $t2->[$depth][$count][$row][$col])
} # if (defined $t2->[$depth][$count][$row][$col])
} #for $col
} #for $row
} #for $count
} #for $depth
print "Enter URL You want to rip links from:";
$url = <>;
chomp $url;
my $tab = pageParse($url, 0, 1);
foreach my $coords (@cells)
{
my ($depth, $count, $row, $col) = @$coords;
my $linkParser = HTML::LinkExtor->new();
my $content = $tab->[$depth][$count][$row][$col];
$linkParser->parse($content);
$content =~ s/<.*?>//g;
my @links = $linkParser->links; # get Links
print "-----Depth $depth ; Count $count ; Row $row ; Col $col \n";
print $content;
print "-----Links:\n";
foreach my $link (@links)
{
my $tag = shift @$link;
if ($tag eq 'a')
{
my %linkHash = @$link;
print $linkHash{href}, "\n"
}
}
print "-----END CONTENT\n";
}
#Parses HTML page and store resulting tables
# into a four dimensional array.
sub pageParse
{
my $url=shift;
my $func = shift ;
my $keep_html = shift || 0;
my $te = new HTML::TableExtract( depth=>0, gridmap=>0,
keep_html=> $keep_html, br_translate=>1);
chomp $url;
my $content = get($url);
$te->parse($content);
my $tables=[];
# Loop through All tables on page
foreach my $ts ($te->table_states())
{
my $row_idx =0;
# Loop through rows for a table
foreach my $row ($ts->rows)
{
my $col_idx =0;
foreach my $column ( @$row) # Loop through columns in row.
{
if ( $func)
{
$column =~ s/\s//g;
my $crc= crc($column, 32); # Build checksum
$column = $crc;
}
else
{
$column =~ s/\s+/ /g;
}
$tables->[$ts->depth()][$ts->count()][$row_idx][$col_idx]
= $column;
$col_idx++;
}
$row_idx++;
}
}
return $tables;
}
sub max # returns max of two values
{
my ($x1, $x2) = @_;
return $x1 if ($x1 gt $x2);
return $x2;
}
Obviously, the previous script is not very practical However
it could be modified and be very useful.
With a few changes you can
create an automated personal newsletter. Instead of asking for three URLs the
script could be modified to watch one particular site. After the first execution
the generated hash from the pageParse subroutine could be stored off. The next
time the script is run the new pageParse result could be compared to the
original. If content is different in any of table cells the content in that cell
could be emailed thus creating an automated newsletter.