It is common knowledge that the Internet is a great data source. It is alsocommon knowledge that it is difficult to get the information you want in the format you need. No longer.

  1. Web Mining with Perl
  2. Accessing The Net (LWP)
  3. Cut Along The Table Lines (HTML::TableExtract)
  4. Learning From Links (HTML::LinkExtor)
  5. Checking For Sameness (String::CRC)
  6. Bringing It All Together
  7. Conclusion
By: Tommie Jones
Rating: starstarstarstarstar / 54
March 05, 2002

LWP, which stands for the libwww-Perl library, is a common module that may have comes with most installations of Perl. LWP (as quoted from the LWP perldoc) is a collection of Perl modules that provide a consistent and simple application-programming interface to the World Wide Web. LWP provides support for redirection, cookies, basic authentication and robot.txt parsing. For the majority of web-crawling requirements a developer can use LWP::Simple. LWP-Simple allows the developer to store the head or body of a web page (given its URL) in a scalar variable or file. Here is an example.

#!/usr/bin/perl use LWP::Simple; #Store the output of the web page (html and all) in content my $content = get("http://www.yahoo.com"); if (defined $content) { #$content will contain the html associated with the url mentioned above. print $content; } else { #If an error occurs then $content will not be defined. print "Error: Get failed"; }
After loading the LWP::Simple module with the use command the get subroutine is called to download the html on the http://www.yahoo.com web site. The html is stored in the $content variable. If there is not an error the $content value is printed to standard output.

Other modules exist in the LWP::Bundle that handle cookies, automatic redirection and other things. For more information please read the perldoc on LWP::RobotUA and LWP::UserAgent.

