To grab headlines of some popular news sites, we'll use Perl, undoubtedly the workhorse when it comes to searching through text (or html) files. The script is easily adaptable for use on other sites, as you'll see later on. The way we will grab the headlines is to fetch the page they reside on, then parse the html, looking for a pattern which indicates the beginning of a headline (normally a font size or color declaration). Then the script will save the text after that match, and before a match indicating the end of the headline. Probably the easiest example to use are the headlines off of Slashdot. Slashdot encourages this sort of thing (within reason) by providing a page with the essential information describing each article. You can view this file here: http://slashdot.org/slashdot.xml Here's the whole script:
This file constructs a very small file with just the headlines that fits nicely into a table cell. Lets step through this file a chunk at a time to better understand what is going on. Lines 3-9 define the variables we'll use to grab the headlines. $before and $after hold stings of text which are before and after the headlines -- the text between these two matches on the same line will be grabbed as the headline. $webdog is a variable put in to make searches a little faster the real search doesn't start until this tag is reached. Line 13 makes a system call to the webget script (with the "quiet" modifier) and puts the results in an array, @lines. I chose webget.pl because it is a single script which requires no external libraries, and is freely available. Similar results can be achieved using the LWP library and its "GET" function. At this point, the entire html file of the remote news page is the @lines array, and now we can begin to manipulate it as we wish. Line 16 sets array element $headlines[0] with the title of the page, font sizes, etc. On 18-19, $found is a flag for if the search for the headlines can begin, while $count will keep track of how many headlines have been found. Line 18 begins the loop that searches line by line through the page for headlines. Each line is checked to see if there is a match for our $webdog variable. If there is no match, the next line is checked. When a match is finally found, the $found flag is changed to "1". Once the $found flag is set, the script looks for the $before text in each line (Line 21). If the text matches, we grab all the text after that match with the $' function (also called $POSTMATCH). The two functions used here a very useful, and allow the script to grab a string we don't necessarily have a match for (like constantly changing headlines). Here are the functions: $' grabs text after a successful match. $` grabs text before a successful match. Using these together allow us to grab a headline by knowing the html tags on either side. On Line 22 we place the text after the first match into the default variable ($_), match the tag after the headline, then strip the headline (which is before that last match) and put it in the $headline variable. Whew! Now that the headline is in an easily handled variable, we just push it onto the @headlines array and add a <BR> for readability (Line 25). Line 26 and 27 are used to limit the number of headlines we grab. Once a limit to the number of headlines is reached (in this case 5), the script breaks the loop. If this limit is never reached, the loop will end after all the lines in the file have been examined. Line 30 on are for housekeeping. The open font and link tags are closed on line 30. Line 31 is commented out for normal operations, but is very useful when debugging your match choices at the command line. After that, we print the @headlines array to the file we specified in $file at the beginning of the script. Note that this is a very simple script that takes you right to the news page. It would be fairly easy to have a separate url for each story, or even more features. And it's easy to customize for other sites by changing the variables at the top of the script. The script creates a file for two reasons: the content will be used many times, and site owners usually don't mind a query every hour or so, but it could be a bit much if your site gets a lot of traffic, and this script is run every time. So the best thing to do with this script, and others you customize to get news from other sites, is to put them in a cron job. After you have a few ascripts, each regularly polling a site, you should have several small text files being regularly produced with their headlines. To keep everything organized, put all of your headline-grabbing scripts into a subdirectory called "news". In this example, we will take some headlines from a couple of sites some people may visit often, then make a form to collect logins and preferences, and store it all in database. Every time the user returns, a script reads his cookie, retrieves his preferences, and builds a page showing what they want to see. We'll start out with a smattering of Perl to help us automate collecting those headlines. First, lets create the table in mysql, in a database named "project":
blog comments powered by Disqus |
|
|
|
|
|
|
|