Web Access with LWP

There are a number of ways you can retrieve information from the web. You can access it directly via a browser, or you can write a script that gets the information for you and delivers it in a form you can use. The LWP library for Perl can help you with the latter. Keep reading for a closer look.

The Web is a wonderful resource. It contains a wealth of information about nearly every conceivable topic, and in order to access much of that information, you only need a Web browser, of which there are several to choose from. For example, if I need to check the weather before I head outside to figure out what I should wear, I can simply navigate to the appropriate page, enter my city, and I’ll be presented with the current weather conditions. Or, if I want to find information about a given movie, I need only Google it, or look it up on IMDB.

This is fine when I’m going to consume the information directly and in its native format. However, what if I want to write a program that accesses this information? Say, for example, that I wanted to record the information, or transform it in some way. 

This is a common task, and accessing information on the Web is actually fairly easy. In fact, there are a number of libraries that can do the job. In this article, we’ll be taking a brief look at LWP, a Perl library designed for Web access. 

The LWP library actually has a number of interesting and complex features, but in this article, we’re only going to look at the basics—enough to design a simple Perl script that can request and receive data from the Web.

Starting out Simple 

As it turns out, LWP provides easy access to the most basic functionality. If all you want to do is retrieve the raw content of a document, LWP makes this straightforward by providing access to a few basic functions contained in LWP::Simple. 

The simplest thing to do is probably to just get the content of a document. This can be done using the appropriately-named get function. This function takes one argument: the URL of the document to be requested. It returns the content of the document. 

So, if we wanted to get the content of Google’s index page, we would only need to make one function call and then print the result to the screen. Let’s go ahead and create a short script that does just that:

 

#!/usr/bin/perl

use strict;use LWP::Simple;print get('http://google.com');

 

The above script is pretty straightforward. As you can see, there’s not much to it. 

Printing the content of a page isn’t an uncommon task, though, and LWP::Simple actually provides a function that both fetches a document’s content and prints it to STDOUT. The function is called getprint and accepts one argument, which is the URL of the document to get, just like the get function. So, we could change the previous script’s last line to this, and the result would be the same:

 

getprint('http://google.com');

 

If we want to store the document’s contents in a file, we could change STDOUT and then call getprint. However, LWP::Simple also provides a function called getstore, which stores the content of a URL in a given file. The first argument is the URL, and the second is the file. In order to store the Google index page in a file called google.html, we’d make the following call:

 

getstore('http://google.com', 'google.html');

 

Sometimes, though, it only makes sense to store a document if it’s been updated. We can do this with the mirror function, which takes the same arguments as the getstore function:

 

mirror('http://google.com', 'google.html');

 

The getprint, getstore and mirror functions do one additional thing. They return the HTTP response code, which in some cases is very useful. These can be checked against constants defined by the library. For example, below we check to see if everything went well:

 

my $response_code = getprint('http://google.com');print "nOKn" if ($response_code == RC_OK);

 

As you can see, the LWP::Simple module makes common tasks very easy, as its name suggests. 

{mospagebreak title=Making Requests} 

For simple applications, LWP::Simple provides all the functionality necessary. However, sometimes you’ll need more control over the process, and the LWP library provides the means necessary to exercise this control. For example, say that you want to submit some form data. Obviously, this requires a bit more work in some cases. 

Now we’re going to take a look at some more advanced functionality that requires a bit of extra effort on the developer’s part. Before, when we wanted the content of a page, we only had to call one message. This would request the desired document and either return the content or output it somewhere. However, we can break this up and perform the individual steps ourselves. This first involves preparing an HTTP request. This is done with the HTTP::Request module, and it isn’t very difficult. 

Since this isn’t an article on HTTP, we’ll skip the underlying details. Let’s jump right in and see how this looks in Perl. Say we want to once again request the Google index page. We’d simply have to create an HTTP::Request object, which represents the request. The constructor takes, at a minimum, two arguments. The first is the method to use (GET or POST are the most common), and the second argument is the URI. 

So, here’s how we would create a request for the Google index page:

 

use HTTP::Request;my $google_request = new HTTP::Request(GET => 'http://google.com');

 

If that’s all we want to do, though, we may be better off just using LWP::Simple. Say, however, that we wanted to post form data. For example, the National Weather Service’s (US) Web site provides a way to look up the forecast for a given city. It gets the city name from form data, through the POST method. What if we wanted to get the weather for Washington, DC? 

Let’s create another request, this time to get the weather. When creating the object, we need to pass in POST rather than GET, and we need to modify the URI. Then, we need to set the content type to an appropriate value. This is done using the content_type method. Finally, we need to specify the form values. These form values are stored in the body of the HTTP request and are set using the content method. 

Here’s what the result looks like:

 

my $request = new HTTP::Request(POST => 'http://forecast.weather.gov/zipcity.php');$request->content_type('application/x-www-form-urlencoded');$request->content('inputstring=Washington,DC');

 

{mospagebreak title=Making it Work} 

Now that we have a request object made, we need to actually make the request to the server. This is done using the LWP::UserAgent module. The user agent puts everything together and makes it all work. It’s the thing that actually communicates with the target Web server. 

To make the request, we need to first create a LWP::UserAgent object. Then, we need to call the user agent’s request method, passing the request object as an argument. This is all really easy to do. The request method will then return a response, which we’ll get to shortly. Let’s return to the Google example here and actually retrieve the index page. Here’s how this is done:

 

use HTTP::Request;use LWP::UserAgent;# Make the request objectmy $request = new HTTP::Request(GET => 'http://google.com');# Create the user agent and make the actual requestmy $ua = new LWP::UserAgent;my $response = $ua->request($request);

 

Notice how the request is the same as before. The only thing we’ve done is added two lines that work with the user agent. 

Now that we have the response, how do we actually extract the content of the page? Before we do this, we’ll want to make sure that everything was successful by checking is_success. If things were successful, then the content of the response is located in content. Let’s add to our script, making it print out the content:

 

# Print out the contentprint $response->content . "n" if ($response->is_success);

 

When you run the script, you should see the source of the Google index page printed out to the screen. 

The script certainly works, but it’s a real pain to create a request object only to use it one time. Fortunately, LWP::UserAgent provides some shortcuts. Instead of creating an HTTP::Request on our own, we can have LWP::UserAgent create one for us. The module provides two methods, get and post, that can do much of the work for you.

These methods can take a number of arguments, but since we don’t have space to cover all of them in this article, we’ll just take a look at the basic functionality provided by them. The first argument of both methods, and the only required one, is the URL. This is the only argument we need to pass in order to rewrite the Google index page script. Let’s do that now:

 

my $ua = new LWP::UserAgent;my $response = $ua->get('http://google.com');print $response->content . "n" if ($response->is_success);

 

As you can see, this script is much shorter than the last one, and it’s easier to read. Notice how the get method returns a response object, just as the request method does. The post method operates the same way. 

{mospagebreak title=Getting the Weather} 

Earlier, I talked about creating a script that would retrieve the weather. However, the script was left unfinished. That was fine then, since I only wanted to explain how to form a POST request, but let’s wrap up by re-examining the weather example in order to create something functional. 

The first thing we should do is rewrite the script so that it uses one of the shortcut methods of LWP::UserAgent that we just examined. Since we need to make a POST request, we’ll use the post method. 

We’ll need to pass in form data, though, and thankfully, the post method makes this really easy. All we have to do is pass a reference to a dictionary of key/value pairs for the form. This will be passed as the second argument to the method. 

Here’s what that looks like:

 

#!/usr/bin/perluse strict;use LWP::UserAgent;# Create the user agentmy $ua = new LWP::UserAgent;# Post the form datamy $res = $ua->post('http://forecast.weather.gov/zipcity.php',

 {'inputstring' => 'Washington,DC'});

 

Again, the post method returns a response object. You might think that we should now examine the content of the response object to search for whatever information we need. However, it turns out that the server returns a 302 status code, indicating that the information we want is found elsewhere. The National Weather Service Web site actually finds a geographical point that matches the input string, and then redirects us to the proper page for that point. 

In order to actually view the weather, then, we need to redirect to the proper location. This location is stored in the response headers in a field called “Location” and can be easily extracted using the header method. Here’s how we extract the new URL:

 

# Extract the new locationmy $location = $res->header('Location');

 

Now we just need to request the new page. After we get the content, we can use regular expressions to extract the current temperature Here’s the final few lines of the script:

 

# Redirect to the new location$res = $ua->get($location);# Get the weatherif ($res->content =~ />(.*?)<br><br>(d+) &d/){ print "$1n$2n";}

 

Of course, there is a lot more to LWP, but it contains far too much functionality to be summarized in a single article. This article, however, has covered the very basics, and you should now be able to create simple applications that can access the Web.

[gp-comments width="770" linklove="off" ]
antalya escort bayan antalya escort bayan