The Web is a wonderful resource. It contains a wealth of information about nearly every conceivable topic, and in order to access much of that information, you only need a Web browser, of which there are several to choose from. For example, if I need to check the weather before I head outside to figure out what I should wear, I can simply navigate to the appropriate page, enter my city, and I'll be presented with the current weather conditions. Or, if I want to find information about a given movie, I need only Google it, or look it up on IMDB.
This is fine when I'm going to consume the information directly and in its native format. However, what if I want to write a program that accesses this information? Say, for example, that I wanted to record the information, or transform it in some way.
This is a common task, and accessing information on the Web is actually fairly easy. In fact, there are a number of libraries that can do the job. In this article, we'll be taking a brief look at LWP, a Perl library designed for Web access.
The LWP library actually has a number of interesting and complex features, but in this article, we're only going to look at the basics—enough to design a simple Perl script that can request and receive data from the Web.
Starting out Simple
As it turns out, LWP provides easy access to the most basic functionality. If all you want to do is retrieve the raw content of a document, LWP makes this straightforward by providing access to a few basic functions contained in LWP::Simple.
The simplest thing to do is probably to just get the content of a document. This can be done using the appropriately-named get function. This function takes one argument: the URL of the document to be requested. It returns the content of the document.
So, if we wanted to get the content of Google's index page, we would only need to make one function call and then print the result to the screen. Let's go ahead and create a short script that does just that:
use strict;use LWP::Simple;print get('http://google.com');
The above script is pretty straightforward. As you can see, there's not much to it.
Printing the content of a page isn't an uncommon task, though, and LWP::Simple actually provides a function that both fetches a document's content and prints it to STDOUT. The function is called getprint and accepts one argument, which is the URL of the document to get, just like the get function. So, we could change the previous script's last line to this, and the result would be the same:
If we want to store the document's contents in a file, we could change STDOUT and then call getprint. However, LWP::Simple also provides a function called getstore, which stores the content of a URL in a given file. The first argument is the URL, and the second is the file. In order to store the Google index page in a file called google.html, we'd make the following call:
Sometimes, though, it only makes sense to store a document if it's been updated. We can do this with the mirror function, which takes the same arguments as the getstore function:
The getprint, getstore and mirror functions do one additional thing. They return the HTTP response code, which in some cases is very useful. These can be checked against constants defined by the library. For example, below we check to see if everything went well:
my $response_code = getprint('http://google.com');print "nOKn" if ($response_code == RC_OK);
As you can see, the LWP::Simple module makes common tasks very easy, as its name suggests.
blog comments powered by Disqus