Home arrow PHP arrow Page 2 - Developing a User Personalization System with PHP and Cookies

Grabbing Headlines - PHP

Making your site more appealing to repeat traffic is always a priority among web developers. However users of your site are likely to be coming to your site for different reasons. This tutorial will teach you how to use PHP and cookies to make your site more user-customizable and therefore more likely to attract repeat users.

  1. Developing a User Personalization System with PHP and Cookies
  2. Grabbing Headlines
  3. User Login
  4. Reading from Cookies
  5. Conclusion
By: Duncan Lamb
Rating: starstarstarstarstar / 6
September 20, 1999

print this article


To grab headlines of some popular news sites, we'll use Perl, undoubtedly the workhorse when it comes to searching through text (or html) files. The script is easily adaptable for use on other sites, as you'll see later on.

The way we will grab the headlines is to fetch the page they reside on, then parse the html, looking for a pattern which indicates the beginning of a headline (normally a font size or color declaration). Then the script will save the text after that match, and before a match indicating the end of the headline.

Probably the easiest example to use are the headlines off of Slashdot. Slashdot encourages this sort of thing (within reason) by providing a page with the essential information describing each article. You can view this file here: http://slashdot.org/slashdot.xml

Here's the whole script:

#!/usr/bin/perl $pagename="Slashdot"; $newsurl ="www.slashdot.org/slashdot.xml"; $homeurl="http://slashdot.org/"; $file="slashdot.lnk"; $before = "<title>"; $after = "<"; $webdog = "story"; #don't search till this is found #$' is post match, $` is prematch. @lines = `perl webget.pl -q $newsurl`; #First line: make it proper home page URL @headlines[0] = "<a href=\"$homeurl\" class=\"newstable\"><b>$pagename</b><font size=1>\n"; $found = 0; $count = 0; foreach $line (@lines) { if ($line =~ /$webdog/i) { $found = 1;} if (($found) and ($line =~ /$before/i)) { $_ = $'; #grabs everything after match /$after/i; $headline = $`; push (@headlines,("<br>".$headline."\n")); #make all font changes, colors, etc. on this line $count++; last if ($count == 5); } } push (@headlines,"</font></a>"); #print @headlines; open (FILE, ">$file"); foreach $headline(@headlines){ print FILE $headline; } close FILE;

This file constructs a very small file with just the headlines that fits nicely into a table cell. Lets step through this file a chunk at a time to better understand what is going on.

Lines 3-9 define the variables we'll use to grab the headlines. $before and $after hold stings of text which are before and after the headlines -- the text between these two matches on the same line will be grabbed as the headline. $webdog is a variable put in to make searches a little faster the real search doesn't start until this tag is reached.

Line 13 makes a system call to the webget script (with the "quiet" modifier) and puts the results in an array, @lines. I chose webget.pl because it is a single script which requires no external libraries, and is freely available. Similar results can be achieved using the LWP library and its "GET" function. At this point, the entire html file of the remote news page is the @lines array, and now we can begin to manipulate it as we wish.

Line 16 sets array element $headlines[0] with the title of the page, font sizes, etc. On 18-19, $found is a flag for if the search for the headlines can begin, while $count will keep track of how many headlines have been found.

Line 18 begins the loop that searches line by line through the page for headlines. Each line is checked to see if there is a match for our $webdog variable. If there is no match, the next line is checked. When a match is finally found, the $found flag is changed to "1".

Once the $found flag is set, the script looks for the $before text in each line (Line 21). If the text matches, we grab all the text after that match with the $' function (also called $POSTMATCH). The two functions used here a very useful, and allow the script to grab a string we don't necessarily have a match for (like constantly changing headlines). Here are the functions:

$' grabs text after a successful match. $` grabs text before a successful match.

Using these together allow us to grab a headline by knowing the html tags on either side. On Line 22 we place the text after the first match into the default variable ($_), match the tag after the headline, then strip the headline (which is before that last match) and put it in the $headline variable. Whew!

Now that the headline is in an easily handled variable, we just push it onto the @headlines array and add a <BR> for readability (Line 25).

Line 26 and 27 are used to limit the number of headlines we grab. Once a limit to the number of headlines is reached (in this case 5), the script breaks the loop. If this limit is never reached, the loop will end after all the lines in the file have been examined.

Line 30 on are for housekeeping. The open font and link tags are closed on line 30. Line 31 is commented out for normal operations, but is very useful when debugging your match choices at the command line. After that, we print the @headlines array to the file we specified in $file at the beginning of the script.

Note that this is a very simple script that takes you right to the news page. It would be fairly easy to have a separate url for each story, or even more features. And it's easy to customize for other sites by changing the variables at the top of the script.

The script creates a file for two reasons: the content will be used many times, and site owners usually don't mind a query every hour or so, but it could be a bit much if your site gets a lot of traffic, and this script is run every time. So the best thing to do with this script, and others you customize to get news from other sites, is to put them in a cron job. After you have a few ascripts, each regularly polling a site, you should have several small text files being regularly produced with their headlines. To keep everything organized, put all of your headline-grabbing scripts into a subdirectory called "news".

In this example, we will take some headlines from a couple of sites some people may visit often, then make a form to collect logins and preferences, and store it all in database. Every time the user returns, a script reads his cookie, retrieves his preferences, and builds a page showing what they want to see. We'll start out with a smattering of Perl to help us automate collecting those headlines.

First, lets create the table in mysql, in a database named "project":

CREATE TABLE users ( login char(16) NOT NULL, password char(10) NOT NULL, lastlogin date DEFAULT '0000-00-00' NOT NULL, news1 char(20), news2 char(20), news3 char(20), PRIMARY KEY (login) );

>>> More PHP Articles          >>> More By Duncan Lamb

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Hackers Compromise PHP Sites to Launch Attac...
- Red Hat, Zend Form OpenShift PaaS Alliance
- PHP IDE News
- BCD, Zend Extend PHP Partnership
- PHP FAQ Highlight
- PHP Creator Didn't Set Out to Create a Langu...
- PHP Trends Revealed in Zend Study
- PHP: Best Methods for Running Scheduled Jobs
- PHP Array Functions: array_change_key_case
- PHP array_combine Function
- PHP array_chunk Function
- PHP Closures as View Helpers: Lazy-Loading F...
- Using PHP Closures as View Helpers
- PHP File and Operating System Program Execut...
- PHP: Effects of Wrapping Code in Class Const...

Developer Shed Affiliates


Dev Shed Tutorial Topics: