Perl
  Home arrow Perl arrow Page 3 - Web Mining with Perl
FaxWave - Free Trial.
Dev Shed Forums 
Administration  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Dedicated Servers 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Download TestComplete 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PERL

Web Mining with Perl
By: Tommie Jones
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 47
    2002-03-05

    Table of Contents:
  • Web Mining with Perl
  • Accessing The Net (LWP)
  • Cut Along The Table Lines (HTML::TableExtract)
  • Learning From Links (HTML::LinkExtor)
  • Checking For Sameness (String::CRC)
  • Bringing It All Together
  • Conclusion

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
     
    ADVERTISEMENT

    Dell PowerEdge Servers

    Web Mining with Perl - Cut Along The Table Lines (HTML::TableExtract)
    (Page 3 of 7 )

    HTML Tables not only help visually segregate data on a web page but they also provide helpful landmarks when parsing web pages. Tables are used to align information on web pages. Tables can force information to be in one location or to take up a certain width of a screen.

    Tables become even more important on dynamic data driven web sites. This is because on most websites content such as articles are stored separately from the page's visual aspects. When generating the HTML pages the content is separated from other features of the web page with a table. In other words the main page might change but the layout defined by tables rarely changes. This is important because when processing a web page the developer will often want to ignore a lot of the static or template data but want to access the dynamic data. The developer of a web crawler will want to identify what tables/rows/cells the data you are interested in is located and pull his information from there.

    Fortunately there exists a Perl Module designed to parse HTML tables. The following example script shows how a particular table can be parsed out of an HTML page.

    #!/usr/bin/Perl use lib qw( ..); use HTML::TableExtract; use LWP::Simple; use Data::Dumper; my $te = new HTML::TableExtract( depth=>3, count=>0, gridmap=>0); my $content = get("http://www.computerjobs.com"); $te->parse($content); foreach $ts ($te->table_states) { foreach $row ($ts->rows) { print Dumper $row; # print Dumper $row if (scalar(@$row) == 2); } }
    Now to explain the highlights of the code.

    my $te = new HTML::TableExtract( depth=>3, count=>0, gridmap=>0);
    This is where we create/initialize the TableExtract object. We pass three parameters to the page. depth => 3 - this is the depth of the table we want to work with. This suggest that this table is inside a table (depth=2) which is inside another table (depth = 1) which is at last in another table (depth=>0) count => 0 - More than one table can exists at the level three. count=>0 suggest that it is the first table that is at depth 3. gridmap => 0 - represents tables as a tree instead of a map.

    The combination of these two parameters uniquely identify any table in an html page. Note that the table identified by (depth=>3, count=>1) is not necessarily the neighbor to the (depth=>3, count=>0) table. For instance

    <table> <tr><td> /*Table depth=>0 count=>0 */ <table><tr><td> /* Table depth=>1 count=>0 */ <table><tr><td> /* Table depth=>2 count=>0 */ </td></tr></table> </td></tr></table> <table><tr><td> /*Table depth=>1 count =>1 */ <table><tr><td> /* Table depth=>2 count=>1 */ </td></tr></table> <table><tr><td> /* Table depth=>2 count=>2 */ </td></tr></table> </td></tr></table> </table><tr><td>
    In the example shown above there are three tables at depth 2 . For the tables (depth=>2 count=>0) and (depth=>2 count=>1) notice that they do not share the same parent table. The count does not reset to zero when the html backs out of the depth. The table identified as (depth=>2 count=>1) is literally the second table(count = 1) at the third depth (both numbers start at zero.).

    The gridmap option tells whether to logically represent data as a grid or a tree. Consider the following example.

    <table> <tr> <td> location [1,1] </td> <td> location [1,2] </td> </tr> <tr colspan=2> <td> location [2,1] </td> </tr> <table>
    If gridmap=1 (as is by default) then the cell [2,2] will be defined but empty. This is because gridmap=1 forces the table to look like a grid. If gridmap=0 the map table would look like a tree where each row could have a different number of cells. Trying to access position[2,2] will not be defined.

    After the table is identified, the object representing the table can be accessed. These verbs include table_states and table_state. Table_state takes a depth and a count as an identifier to return a table state object. Table_states returns an array of table_states to represent our code.

    A TableExtract object can represent multiple tables. This can be accomplished by only specifying depth or count (not both). This will return an object representing multiple tables.

    In the first for loop we are going through the list of tables. This is done with the table_states object. The inner loop loops through the rows inside each table (represented by the tr tag.) The results of the rows tag is an array of arrays that represent the two-dimensional table.

    More Perl Articles
    More By Tommie Jones


     

       

    PERL ARTICLES

    - Perl: A Continuing Look at Hashes and Multid...
    - Perl: Another Round with Hashes
    - Perl Hashes
    - Perl Lists: A Final Look at List::Util
    - Perl Lists: Utilizing List::Util
    - Perl Lists: The Split() Function
    - SQL and CGI with Perl and DBI
    - Perl Lists: More Functions and Operators
    - SELECT Queries and Perl
    - Perl Lists: More on Manipulation
    - Creating a Database with Perl and DBI
    - Perl: Sailing the List(less) Seas
    - Perl and DBI
    - Perl: Concatenating Text and More
    - Perl Text: Quoting Without Quote Marks

     
    Accelerating Trading Partner Performance
     
    Competing on Analytics
     
    Cost Effective Scaling with Virtualization and Coyote Point Systems
     
    Five Checkpoints to Implementing IP Telephony
     
    Hosted Email Security: Staying Ahead of New Threats
     




    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway