It is common knowledge that the Internet is a great data source. It is alsocommon knowledge that it is difficult to get the information you want in the format you need. No longer.
The following code is a Perl script that uses the discussed modules (except HTML::Parser). For this script you give it three URLs. All three URLs should refer to different pages on a single web site.
The first two URLs are divided into tables and their cells contents are compared to each other. When the contents of the cells of each table match they are assumed to make up the template. If the cells content is different between the two tables then these cells are identified as the dynamic content. Each table is stored in a four dimensional array. The first two dimensions identify a table (depth, count) as discussed in the HTML::TableExtract section. The last two dimensions refer too a particular cell in each table (row, column). The cells are compared to each other and any cell where the content differs between the tow pages is identified and stored in the @cells array. The theory is that cells that contain the site's menus and other template type content will be ignored. The cells that contain content that changes from one page to the next will be recorded.
The third URL is used to test the content extraction theory. The HTML::LinkExtor first parses the contents of each changed cell. This finds the html links stored in the content. The content is then stripped of html and printed to the screen. Last of all the links found in the content is printed to the screen.
#!/usr/bin/Perl
use lib qw(. ..);
use HTML::TableExtract;
use LWP::Simple;
use String::CRC;
use Data::Dumper;
use HTML::LinkExtor;
# Data Entry Portion
print "Enter first URL: ";
my $url = <>;
chomp $url;
my $t1 = pageParse($url, 1, 0); #Parse out tables in first URL
print "Enter next URL:";
my $url2 = <>;
my $t2 = pageParse($url2, 1, 0); # Parse out tables in second URL
my ($depth, $count, $row, $col);
# Loop through elements of array and find the cells that do not
# have equivalent content
for ($depth=0;$depth< max(scalar(@$t1), scalar(@$t2)); $depth++)
{
for ($count=0;$count< max(scalar(@{$t1->[$depth]}),
scalar(@{$t2->[$depth]})); $count++)
{
for ($row=0;
$row < max(scalar(@{$t1->[$depth][$count]}),
scalar(@{$t2->[$depth][$count]}));
$row++)
{
for ($col=0;
$col< max(scalar(@{$t1->[$depth][$count][$row]}),
scalar(@{$t2->[$depth][$count][$row]}));
$col++)
{
if (defined $t2->[$depth][$count][$row][$col])
{
if ($t1->[$depth][$count][$row][$col]
ne $t2->[$depth][$count][$row][$col])
{
print " Cell $depth $count $row $col differs\n";
push @cells, [$depth, $count, $row, $col];
} #if ($t1->[$depth][$count][$row][$col]
ne $t2->[$depth][$count][$row][$col])
} # if (defined $t2->[$depth][$count][$row][$col])
} #for $col
} #for $row
} #for $count
} #for $depth
print "Enter URL You want to rip links from:";
$url = <>;
chomp $url;
my $tab = pageParse($url, 0, 1);
foreach my $coords (@cells)
{
my ($depth, $count, $row, $col) = @$coords;
my $linkParser = HTML::LinkExtor->new();
my $content = $tab->[$depth][$count][$row][$col];
$linkParser->parse($content);
$content =~ s/<.*?>//g;
my @links = $linkParser->links; # get Links
print "-----Depth $depth ; Count $count ; Row $row ; Col $col \n";
print $content;
print "-----Links:\n";
foreach my $link (@links)
{
my $tag = shift @$link;
if ($tag eq 'a')
{
my %linkHash = @$link;
print $linkHash{href}, "\n"
}
}
print "-----END CONTENT\n";
}
#Parses HTML page and store resulting tables
# into a four dimensional array.
sub pageParse
{
my $url=shift;
my $func = shift ;
my $keep_html = shift || 0;
my $te = new HTML::TableExtract( depth=>0, gridmap=>0,
keep_html=> $keep_html, br_translate=>1);
chomp $url;
my $content = get($url);
$te->parse($content);
my $tables=[];
# Loop through All tables on page
foreach my $ts ($te->table_states())
{
my $row_idx =0;
# Loop through rows for a table
foreach my $row ($ts->rows)
{
my $col_idx =0;
foreach my $column ( @$row) # Loop through columns in row.
{
if ( $func)
{
$column =~ s/\s//g;
my $crc= crc($column, 32); # Build checksum
$column = $crc;
}
else
{
$column =~ s/\s+/ /g;
}
$tables->[$ts->depth()][$ts->count()][$row_idx][$col_idx]
= $column;
$col_idx++;
}
$row_idx++;
}
}
return $tables;
}
sub max # returns max of two values
{
my ($x1, $x2) = @_;
return $x1 if ($x1 gt $x2);
return $x2;
}
Obviously, the previous script is not very practical However
it could be modified and be very useful.
With a few changes you can create an automated personal newsletter. Instead of asking for three URLs the script could be modified to watch one particular site. After the first execution the generated hash from the pageParse subroutine could be stored off. The next time the script is run the new pageParse result could be compared to the original. If content is different in any of table cells the content in that cell could be emailed thus creating an automated newsletter.