It is common knowledge that the Internet is a great data source. It is alsocommon knowledge that it is difficult to get the information you want in the format you need. No longer.
LWP, which stands for the libwww-Perl library, is a common module that may have comes with most installations of Perl. LWP (as quoted from the LWP perldoc) is a collection of Perl modules that provide a consistent and simple application-programming interface to the World Wide Web. LWP provides support for redirection, cookies, basic authentication and robot.txt parsing. For the majority of web-crawling requirements a developer can use LWP::Simple. LWP-Simple allows the developer to store the head or body of a web page (given its URL) in a scalar variable or file. Here is an example.
#!/usr/bin/perl
use LWP::Simple;
#Store the output of the web page (html and all) in content
my $content = get("http://www.yahoo.com");
if (defined $content)
{
#$content will contain the html associated with the url mentioned above.
print $content;
}
else
{
#If an error occurs then $content will not be defined.
print "Error: Get failed";
}
After loading the LWP::Simple module with the use command the
get subroutine is called to download the html on the http://www.yahoo.com web site. The html is stored in the $content variable. If there is not an error the $content value is printed to standard output.
Other modules exist in the LWP::Bundle that handle cookies, automatic redirection and other things. For more information please read the perldoc on LWP::RobotUA and LWP::UserAgent.