Home arrow Perl Programming arrow Page 4 - Web Mining with Perl

Learning From Links (HTML::LinkExtor) - Perl

It is common knowledge that the Internet is a great data source. It is alsocommon knowledge that it is difficult to get the information you want in the format you need. No longer.

TABLE OF CONTENTS:
  1. Web Mining with Perl
  2. Accessing The Net (LWP)
  3. Cut Along The Table Lines (HTML::TableExtract)
  4. Learning From Links (HTML::LinkExtor)
  5. Checking For Sameness (String::CRC)
  6. Bringing It All Together
  7. Conclusion
By: Tommie Jones
Rating: starstarstarstarstar / 54
March 05, 2002

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement
A lot can be learned from what web sites link to. Contributors to Web based Bulletin boards will supply links to the subject of their discussions. Articles that mention products will provide links to where that product can be ordered. Google's web page ranking and grouping algorithms calculate how important a web page is by how many sites link to it. Another possible use would be an organization that wants to write a web crawler for marketing their product may search the web for any mention of their or their competitor's product.

A useful module to extract links out of a web page is HTML::LinkExtor. Here is an example:

#!/usr/bin/Perl use LWP::Simple; use HTML::LinkExtor; use Data::Dumper; my $content = get("http://www.yahoo.com"); #Get web page in content die "get failed" if (!defined $content); my $parser = HTML::LinkExtor->new(); #create LinkExtor object with no callbacks $parser->parse($content); #parse content my @links = $parser->links; #get list of links print Dumper \@links; #print list of links out.
This script will parse a website specified in the get command. After the web site's content is parsed the resulting sub array will be stored in the links array (@links). Each sub array in the links array will represent a link. The sub array contains the element tag name as the first element in the sub array. The remaining elements are name/value pairs that were in each tag. The tags that are processed are not only 'A' link tags but also 'img' tags and 'form' tags. Any tag that can have a 'href' as an attribute.

Another approach to use HTML::LinkExtor is to provide a callback function in the new method. When parsing the content the callback function will be called for every link tag found. Please review the perldoc for HTML::LinkExtor for more information. {mospagebreak title=Generic HTML Parsing (HTML::Parser)} The two previously mentioned modules inherit from the HTML::Parser. HTML::Parser works in a similar manner to the SAX interface for XML. HTML::Parser is an event driven parser designed to work with HTML. Recognized events include start tags, end tags, text and comments. For each event handler argspec, which is a list of information that will be passed to the handler, can be defined.

Here is an example to work through.

#!/usr/bin/Perl use LWP::Simple; use HTML::Parser; use Data::Dumper; my $url = shift @ARGV; die "No URL specified on command line." unless (defined $url); my $content = get($url); #put site html in $content. die "get failed" if (!defined $content); # create parser object my $parser = HTML::Parser->new(api_version=>3, start_h=>[\&startTag, 'tag, attr'] , end_h=>[\&endTag, 'tag'] , text_h=>[\&textElem, 'text'] ); #parse object. $parser->parse($content); sub startTag { my ($tag, $attrHash) = @_; print "TAG: $tag \n"; print "ATTR HASH: " , Dumper $attrHash , "\n"; print "-----\n"; } sub endTag { my $tag = shift; print "END TAG: $tag \n"; print "-----\n"; } sub textElem { my $text = shift; print "TEXT: $text \n"; print "-----\n"; }
Note that in the above code the events are defined in the HTML::Parser tag. For each event defined (The name component of the passed variables that end in _h) A reference to a subroutine and a string which is the argspec. Whenever the event occurs the referenced subroutine will be called passing as parameters the argspec. For example whenever a start tag occurs (A tag that looks like ' <...> ' ) the startTag subroutine is called passing the 'tag' scalar and the attr hash-ref. Possible argspecs include: self, tokens, tokenpos, token0, tagname, tag, attr, attrseq, @attr, text, dtext, is_cdata, offset, event and many others. Most of these are the same data in a slightly different format. The perldoc for HTML::Parser will provides more information.

HTML::Parser is an excellent module and is very flexible.

 
 
>>> More Perl Programming Articles          >>> More By Tommie Jones
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

PERL PROGRAMMING ARTICLES

- Perl Turns 25
- Lists and Arguments in Perl
- Variables and Arguments in Perl
- Understanding Scope and Packages in Perl
- Arguments and Return Values in Perl
- Invoking Perl Subroutines and Functions
- Subroutines and Functions in Perl
- Perl Basics: Writing and Debugging Programs
- Structure and Statements in Perl
- First Steps in Perl
- Completing Regular Expression Basics
- Modifiers, Boundaries, and Regular Expressio...
- Quantifiers and Other Regular Expression Bas...
- Parsing and Regular Expression Basics
- Hash Functions

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: