Web Mining with Perl - Cut Along The Table Lines (HTML::TableExtract) (
Page 3 of 7 )
HTML Tables
not only help visually segregate data on a web page but they also provide
helpful landmarks when parsing web pages. Tables are used to align information
on web pages. Tables can force information to be in one location or to take up a
certain width of a screen.
Tables become even more important on dynamic
data driven web sites. This is because on most websites content such as articles
are stored separately from the page's visual aspects. When generating the HTML
pages the content is separated from other features of the web page with a table.
In other words the main page might change but the layout defined by tables
rarely changes. This is important because when processing a web page the
developer will often want to ignore a lot of the static or template data but
want to access the dynamic data. The developer of a web crawler will want to
identify what tables/rows/cells the data you are interested in is located and
pull his information from there.
Fortunately there exists a Perl Module
designed to parse HTML tables. The following example script shows how a
particular table can be parsed out of an HTML page.
#!/usr/bin/Perl
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
my $te = new HTML::TableExtract( depth=>3, count=>0, gridmap=>0);
my $content = get("http://www.computerjobs.com");
$te->parse($content);
foreach $ts ($te->table_states)
{
foreach $row ($ts->rows)
{
print Dumper $row;
# print Dumper $row if (scalar(@$row) == 2);
}
}
Now to explain the highlights of the code.
my $te = new HTML::TableExtract( depth=>3, count=>0, gridmap=>0);
This is where we create/initialize the TableExtract object.
We pass three parameters to the page. depth => 3 - this is the depth of the
table we want to work with. This suggest that this table is inside a table
(depth=2) which is inside another table (depth = 1) which is at last in another
table (depth=>0) count => 0 - More than one table can exists at the level
three. count=>0 suggest that it is the first table that is at depth 3.
gridmap => 0 - represents tables as a tree instead of a map.
The
combination of these two parameters uniquely identify any table in an html page.
Note that the table identified by (depth=>3, count=>1) is not necessarily
the neighbor to the (depth=>3, count=>0) table. For instance
<table> <tr><td> /*Table depth=>0 count=>0 */
<table><tr><td> /* Table depth=>1 count=>0 */
<table><tr><td>
/* Table depth=>2 count=>0 */
</td></tr></table>
</td></tr></table>
<table><tr><td> /*Table depth=>1 count =>1 */
<table><tr><td>
/* Table depth=>2 count=>1 */
</td></tr></table>
<table><tr><td>
/* Table depth=>2 count=>2 */
</td></tr></table>
</td></tr></table>
</table><tr><td>
In the example shown above there are three tables at depth 2
. For the tables (depth=>2 count=>0) and (depth=>2 count=>1) notice
that they do not share the same parent table. The count does not reset to zero
when the html backs out of the depth. The table identified as (depth=>2
count=>1) is literally the second table(count = 1) at the third depth (both
numbers start at zero.).
The gridmap option tells whether to logically
represent data as a grid or a tree. Consider the following example.
<table>
<tr>
<td> location [1,1] </td>
<td> location [1,2] </td>
</tr>
<tr colspan=2>
<td> location [2,1] </td>
</tr>
<table>
If gridmap=1 (as is by default) then the cell [2,2] will be
defined but empty. This is because gridmap=1 forces the table to look like a
grid. If gridmap=0 the map table would look like a tree where each row could
have a different number of cells. Trying to access position[2,2] will not be
defined.
After the table is identified, the object representing the table
can be accessed. These verbs include table_states and table_state. Table_state
takes a depth and a count as an identifier to return a table state object.
Table_states returns an array of table_states to represent our code.
A
TableExtract object can represent multiple tables. This can be accomplished by
only specifying depth or count (not both). This will return an object
representing multiple tables.
In the first for loop we are going through
the list of tables. This is done with the table_states object. The inner loop
loops through the rows inside each table (represented by the tr tag.) The
results of the rows tag is an array of arrays that represent the two-dimensional
table.