Perl
  Home arrow Perl arrow Page 4 - Web Mining with Perl
Dev Shed Forums 
Administration  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Dedicated Servers 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Download TestComplete 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PERL

Web Mining with Perl
By: Tommie Jones
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 47
    2002-03-05

    Table of Contents:
  • Web Mining with Perl
  • Accessing The Net (LWP)
  • Cut Along The Table Lines (HTML::TableExtract)
  • Learning From Links (HTML::LinkExtor)
  • Checking For Sameness (String::CRC)
  • Bringing It All Together
  • Conclusion

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
     
    ADVERTISEMENT

    PCmover - $15 Off with Coupon Code CJPH7Q

    Web Mining with Perl - Learning From Links (HTML::LinkExtor)
    (Page 4 of 7 )

    A lot can be learned from what web sites link to. Contributors to Web based Bulletin boards will supply links to the subject of their discussions. Articles that mention products will provide links to where that product can be ordered. Google's web page ranking and grouping algorithms calculate how important a web page is by how many sites link to it. Another possible use would be an organization that wants to write a web crawler for marketing their product may search the web for any mention of their or their competitor's product.

    A useful module to extract links out of a web page is HTML::LinkExtor. Here is an example:

    #!/usr/bin/Perl use LWP::Simple; use HTML::LinkExtor; use Data::Dumper; my $content = get("http://www.yahoo.com"); #Get web page in content die "get failed" if (!defined $content); my $parser = HTML::LinkExtor->new(); #create LinkExtor object with no callbacks $parser->parse($content); #parse content my @links = $parser->links; #get list of links print Dumper \@links; #print list of links out.
    This script will parse a website specified in the get command. After the web site's content is parsed the resulting sub array will be stored in the links array (@links). Each sub array in the links array will represent a link. The sub array contains the element tag name as the first element in the sub array. The remaining elements are name/value pairs that were in each tag. The tags that are processed are not only 'A' link tags but also 'img' tags and 'form' tags. Any tag that can have a 'href' as an attribute.

    Another approach to use HTML::LinkExtor is to provide a callback function in the new method. When parsing the content the callback function will be called for every link tag found. Please review the perldoc for HTML::LinkExtor for more information. {mospagebreak title=Generic HTML Parsing (HTML::Parser)} The two previously mentioned modules inherit from the HTML::Parser. HTML::Parser works in a similar manner to the SAX interface for XML. HTML::Parser is an event driven parser designed to work with HTML. Recognized events include start tags, end tags, text and comments. For each event handler argspec, which is a list of information that will be passed to the handler, can be defined.

    Here is an example to work through.

    #!/usr/bin/Perl use LWP::Simple; use HTML::Parser; use Data::Dumper; my $url = shift @ARGV; die "No URL specified on command line." unless (defined $url); my $content = get($url); #put site html in $content. die "get failed" if (!defined $content); # create parser object my $parser = HTML::Parser->new(api_version=>3, start_h=>[\&startTag, 'tag, attr'] , end_h=>[\&endTag, 'tag'] , text_h=>[\&textElem, 'text'] ); #parse object. $parser->parse($content); sub startTag { my ($tag, $attrHash) = @_; print "TAG: $tag \n"; print "ATTR HASH: " , Dumper $attrHash , "\n"; print "-----\n"; } sub endTag { my $tag = shift; print "END TAG: $tag \n"; print "-----\n"; } sub textElem { my $text = shift; print "TEXT: $text \n"; print "-----\n"; }
    Note that in the above code the events are defined in the HTML::Parser tag. For each event defined (The name component of the passed variables that end in _h) A reference to a subroutine and a string which is the argspec. Whenever the event occurs the referenced subroutine will be called passing as parameters the argspec. For example whenever a start tag occurs (A tag that looks like ' <...> ' ) the startTag subroutine is called passing the 'tag' scalar and the attr hash-ref. Possible argspecs include: self, tokens, tokenpos, token0, tagname, tag, attr, attrseq, @attr, text, dtext, is_cdata, offset, event and many others. Most of these are the same data in a slightly different format. The perldoc for HTML::Parser will provides more information.

    HTML::Parser is an excellent module and is very flexible.

    More Perl Articles
    More By Tommie Jones


     

       

    PERL ARTICLES

    - Perl: A Continuing Look at Hashes and Multid...
    - Perl: Another Round with Hashes
    - Perl Hashes
    - Perl Lists: A Final Look at List::Util
    - Perl Lists: Utilizing List::Util
    - Perl Lists: The Split() Function
    - SQL and CGI with Perl and DBI
    - Perl Lists: More Functions and Operators
    - SELECT Queries and Perl
    - Perl Lists: More on Manipulation
    - Creating a Database with Perl and DBI
    - Perl: Sailing the List(less) Seas
    - Perl and DBI
    - Perl: Concatenating Text and More
    - Perl Text: Quoting Without Quote Marks

     
    Accelerating Trading Partner Performance
     
    Competing on Analytics
     
    Cost Effective Scaling with Virtualization and Coyote Point Systems
     
    Five Checkpoints to Implementing IP Telephony
     
    Hosted Email Security: Staying Ahead of New Threats
     




    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway