Web Mining with Perl - Accessing The Net (LWP) (
Page 2 of 7 )
LWP, which stands for the
libwww-Perl library, is a common module that may have comes with most
installations of Perl. LWP (as quoted from the LWP perldoc) is a collection of
Perl modules that provide a consistent and simple application-programming
interface to the World Wide Web. LWP provides support for redirection, cookies,
basic authentication and robot.txt parsing. For the majority of web-crawling
requirements a developer can use LWP::Simple. LWP-Simple allows the developer to
store the head or body of a web page (given its URL) in a scalar variable or
file. Here is an example.
#!/usr/bin/perl
use LWP::Simple;
#Store the output of the web page (html and all) in content
my $content = get("http://www.yahoo.com");
if (defined $content)
{
#$content will contain the html associated with the url mentioned above.
print $content;
}
else
{
#If an error occurs then $content will not be defined.
print "Error: Get failed";
}
After loading the LWP::Simple module with the use command the
get subroutine is called to download the html on the http://www.yahoo.com web
site. The html is stored in the $content variable. If there is not an error the
$content value is printed to standard output.
Other modules exist in the
LWP::Bundle that handle cookies, automatic redirection and other things. For
more information please read the perldoc on LWP::RobotUA and LWP::UserAgent.