Strings and Regular Expressions

Strings and regular expressions are among the basic tools that help programmers get their jobs done. This five-part article series covers how these are used in PHP. It is excerpted from chapter nine of the book Beginning PHP and Oracle: From Novice to Professional, written by W. Jason Gilmore and Bob Bryla (Apress; ISBN: 1590597702).

Programmers build applications that are based on established rules regarding the classification, parsing, storage, and display of information, whether that information consists of gourmet recipes, store sales receipts, poetry, or some other collection of data. This chapter introduces many of the PHP functions that you’ll undoubtedly use on a regular basis when performing such tasks.

This chapter covers the following topics:

  1. Regular expressions: A brief introduction to regular expressions touches upon the features and syntax of PHP’s two supported regular expression implementations: POSIX and Perl. Following that is a complete introduction to PHP’s respective function libraries. 
     
  2. String manipulation: It’s conceivable that throughout your programming career, you’ll somehow be required to modify every possible aspect of a string. Many of the powerful PHP functions that can help you to do so are introduced in this chapter. 
     
  3. The PEAR Validate_US package: In this and subsequent chapters, various PEAR packages are introduced that are relevant to the respective chapter’s subject matter. This chapter introduces Validate_US , a PEAR package that is useful for validating the syntax for items commonly used in applications of all types, including phone numbers, Social Security numbers (SSNs), ZIP codes, and state abbreviations. (If you’re not familiar with PEAR, it’s introduced in Chapter 11.)

Regular Expressions

Regular expressions provide the foundation for describing or matching data according to defined syntax rules. A regular expression is nothing more than a pattern of characters itself, matched against a certain parcel of text. This sequence may be a pattern with which you are already familiar, such as the word dog, or it may be a pattern with specific meaning in the context of the world of pattern matching, <(?)>.*< /.?> , for example.

PHP is bundled with function libraries supporting both the POSIX and Perl regular expression implementations. Each has its own unique style of syntax and is discussed accordingly in later sections. Keep in mind that innumerable tutorials have been written regarding this matter; you can find information on the Web and in various books. Therefore, this chapter provides just a basic introduction to each, leaving it to you to search out further information.

If you are not already familiar with the mechanics of general expressions, please take some time to read through the short tutorial that makes up the remainder of this section. If you are already a regular expression pro, feel free to skip past the tutorial to the section “PHP’s Regular Expression Functions (POSIX Extended).”

{mospagebreak title=Regular Expression Syntax (POSIX)}

The structure of a POSIX regular expression is similar to that of a typical arithmetic expression: various elements (operators) are combined to form a more complex expression. The meaning of the combined regular expression elements is what makes them so powerful. You can locate not only literal expressions, such as a specific word or number, but also a multitude of semantically different but syntactically similar strings, such as all HTML tags in a file.


Note  POSIX stands for Portable Operating System Interface for Unix, and is representative of a set of standards originally intended for Unix-based operating systems. POSIX regular expression syntax is an attempt to standardize how regular expressions are implemented in many programming languages.


The simplest regular expression is one that matches a single character, such as g , which would match strings such as gog , haggle , and bag . You could combine several letters together to form larger expressions, such as gan , which logically would match any string containing gan : gang , organize , or Reagan , for example.

You can also test for several different expressions simultaneously by using the pipe ( | ) character. For example, you could test for php or zend via the regular expression php|zend .

Before getting into PHP’s POSIX-based regular expression functions, let’s review three methods that POSIX supports for locating different character sequences: brackets, quantifiers, and predefined character ranges.

Brackets

Brackets ( [] ) are used to represent a list, or range, of characters to be matched. For instance, contrary to the regular expression php , which will locate strings containing the explicit string php , the regular expression [php] will find any string containing the character p or h . Several commonly used character ranges follow:

  1. [0-9] matches any decimal digit from 0 through 9 .
  2. [a-z] matches any character from lowercase a through lowercase z
     
  3. [A-Z] matches any character from uppercase A through uppercase Z
     
  4. [A-Za-z] matches any character from uppercase A through lowercase z .

Of course, the ranges shown here are general; you could also use the range [0-3] to match any decimal digit ranging from 0 through 3 , or the range [b-v] to match any lowercase character ranging from b through v . In short, you can specify any ASCII range you wish.

Quantifiers

Sometimes you might want to create regular expressions that look for characters based on their frequency or position. For example, you might want to look for strings containing one or more instances of the letter p , strings containing at least two p ’s, or even strings with the letter p as their beginning or terminating character. You can make these demands by inserting special characters into the regular expression. Here are several examples of these characters:

  1. p+ matches any string containing at least one p
     
  2. p* matches any string containing zero or more
    p ’s. 
     
  3. p? matches any string containing zero or one p
     
  4. p{2} matches any string containing a sequence of two p ’s. 
     
  5. p{2,3} matches any string containing a sequence of two or three p ’s. 
     
  6. p{2,} matches any string containing a sequence of at least two p ’s. 
     
  7. p$ matches any string with p at the end of it .

Still other flags can be inserted before and within a character sequence: 
 

  1. ^p matches any string with p at the beginning of it. 
     
  2. [^a-zA-Z] matches any string not containing any of the characters ranging from a through z and A through Z
     
  3. p.p matches any string containing p , followed by any character, in turn followed by another p .

You can also combine special characters to form more complex expressions. Consider the following examples:

  1. ^.{2}$ matches any string containing exactly two characters. 
     
  2. <b>(.*)</b> matches any string enclosed within <b> and </b>
     
  3. p(hp)* matches any string containing a p followed by zero or more instances of the sequence hp .

You may wish to search for these special characters in strings instead of using them in the special context just described. To do so, the characters must be escaped with a backslash ( ). For example, if you want to search for a dollar amount, a plausible regular expression would be as follows: ([$])([0-9]+) ; that is, a dollar sign followed by one or more integers. Notice the backslash preceding the dollar sign. Potential matches of this regular expression include $42 , $560 and $3 .

Predefined Character Ranges (Character Classes)

For reasons of convenience, several predefined character ranges, also known as character classes, are available. Character classes specify an entire range of characters—for example, the alphabet or an integer set. Standard classes include the following:

[:alpha:] : Lowercase and uppercase alphabetical characters. This can also be specified as
[A-Za-z].

[:alnum:] : Lowercase and uppercase alphabetical characters and numerical digits. This can also be specified as [A-Za-z0-9].

[:cntrl:] : Control characters such as tab, escape, or backspace.

[:digit:] : Numerical digits 0 through 9. This can also be specified as [0-9].

[:graph:]: Printable characters found in the range of ASCII 33 to 126.

[:lower:] : Lowercase alphabetical characters. This can also be specified as [a-z].

[:punct:] : Punctuation characters, including ~`! @ # $ % ^&* ( )-_+={ } [ ] : ;'< > ,.? and /.

[:upper:] : Uppercase alphabetical characters. This can also be specified as [A-Z].

[:space:] : Whitespace characters, including the space, horizontal tab, vertical tab, new line, form feed, or carriage return.

[:xdigit:] : Hexadecimal characters. This can also be specified as [a-fA-F0-9].

{mospagebreak title=PHP’s Regular Expression Functions (POSIX Extended)} 

PHP offers seven functions for searching strings using POSIX-style regular expressions: ereg(), ereg_replace(), eregi(), eregi_replace(), split(), spliti(), and sql_regcase(). These functions are discussed in this section.

Performing a Case-Sensitive Search

The ereg() function executes a case-sensitive search of a string for a defined pattern, returning TRUE if the pattern is found, and FALSE otherwise. Its prototype follows:

boolean ereg(string pattern, string string [, array regs])

Here’s how you could use ereg() to ensure that a username consists solely of lowercase letters:

<?php
    $username = "jasoN";
    if (ereg("([^a-z])",$username))
        echo "Username must be all lowercase!";
    else
        echo "Username is all lowercase!";
?>

In this case, ereg() will return TRUE , causing the error message to output.

The optional input parameter regs contains an array of all matched expressions that are grouped by parentheses in the regular expression. Making use of this array, you could segment a URL into several pieces, as shown here:

<?php
    $url = "http://www.apress.com";

    // Break $url down into three distinct pieces:
    // "http://www", "apress", and "com"
    $parts = ereg("^(http://www).([[:alnum:]]+).([[:alnum:]]+)", $url, $regs);

    echo $regs[0];    // outputs the entire string http://www.apress.com
    echo "<br />";
    echo $regs[1];    // outputs http://www
    echo "<br />";
    echo $regs[2];    // outputs "apress"
    echo "<br />";
    echo $regs[3];    // outputs "com"
?>

This returns the following:

——————————————–
http://www.apress.com

http://www
apress
com
——————————————–

Performing a Case-Insensitive Search

The eregi() function searches a string for a defined pattern in a case-insensitive fashion. Its prototype follows:

int eregi(string pattern, string string, [array regs])

This function can be useful when checking the validity of strings, such as passwords. This concept is illustrated in the following example:

<?php
    $pswd = "jasonasdf";
    if (!eregi("^[a-zA-Z0-9]{8,10}$", $pswd))
       
echo "Invalid password!";
    else
        echo "Valid password!";
?>

In this example, the user must provide an alphanumeric password consisting of eight to ten characters, or else an error message is displayed.

Replacing Text in a Case-Sensitive Fashion

The ereg_replace() function operates much like ereg() , except that its power is extended to finding and replacing a pattern with a replacement string instead of simply locating it. Its prototype follows:

string ereg_replace(string pattern, string replacement, string string)

If no matches are found, the string will remain unchanged. Like ereg() , ereg_replace() is case sensitive. Consider an example:

<?php
    $text = "This is a link to http://www.wjgilmore.com/.";
    echo ereg_replace("http://([a-zA-Z0-9./-]+)$", 
                  
"<a href="\0">\0</a>",
                      $text);
?>

This returns the following:

——————————————–
This is a link to
<a href="http://www.wjgilmore.com/">http:// www.wjgilmore.com</a>.

——————————————– 

A rather interesting feature of PHP’s string-replacement capability is the ability to back-reference parenthesized substrings. This works much like the optional input parameter regs in the function ereg() , except that the substrings are referenced using backslashes, such as , 1 , 2 , and so on, where refers to the entire string, 1 the first successful match, and so on. Up to nine back references can be used. This example shows how to replace all references to a URL with a working hyperlink:

$url = "Apress (http://www.apress.com)"; $url = ereg_replace("http://([a-zA-Z0-9./-]+)([a-zA-Z/]+)",
                    "<a href="\0">\0</a>", $url);
                  
echo $url;
// Displays Apress (<a href="http://www.apress.com">http://www.apress.com</a>)


Note  Although ereg_replace() works just fine, another predefined function named str_replace() is actually much faster when complex regular expressions are not required. str_replace() is discussed in the later section “Replacing All Instances of a String with Another String.”


Replacing Text in a Case-Insensitive Fashion

The eregi_replace() function operates exactly like ereg_replace() , except that the search for
pattern in string is not case sensitive. Its prototype follows:

string eregi_replace(string pattern, string replacement, string string)

Splitting a String into Various Elements Based on a Case-Sensitive Pattern

The split() function divides a string into various elements, with the boundaries of each element based on the occurrence of a defined pattern within the string. Its prototype follows:

array split(string pattern, string string [, int limit])

The optional input parameter limit is used to specify the number of elements into which the string should be divided, starting from the left end of the string and working rightward. In cases where the pattern is an alphabetical character, split() is case sensitive. Here’s how you would use split() to break a string into pieces based on occurrences of horizontal tabs and newline characters:

<?php
   
$text = "this istsome text thatnwe might like to parse.";
   
print_r(split("[nt]",$text));
?>

This returns the following:

——————————————–
Array ( [0] => this is [1] => some text that [2] => we might like to parse. )

——————————————–

Splitting a String into Various Elements Based on a Case-Insensitive Pattern

The spliti() function operates exactly in the same manner as its sibling, split() , except that its pattern is treated in a case-insensitive fashion. Its prototype follows:

array spliti(string pattern, string string [, int limit])

Accomodating Products Supporting Solely Case-Sensitive Regular Expressions

The sql_regcase() function converts each character in a string into a bracketed expression containing two characters. If the character is alphabetical, the bracket will contain both forms; otherwise, the original character will be left unchanged. Its prototype follows:

string sql_regcase(string string)

You might use this function as a workaround when using PHP applications to talk to other applications that support only case-sensitive regular expressions. Here’s how you would use sql_regcase() to convert a string:

<?php
   
$version = "php 4.0";
   
echo sql_regcase($version);
   
// outputs [Pp] [Hh] [Pp] 4.0
?>

{mospagebreak title=Regular Expression Syntax (Perl)} 

Perl has long been considered one of the most powerful parsing languages ever written, and it provides a comprehensive regular expression language that can be used to search and replace even the most complicated of string patterns. The developers of PHP felt that instead of reinventing the regular expression wheel, so to speak, they should make the famed Perl regular expression syntax available to PHP users.

Perl’s regular expression syntax is actually a derivation of the POSIX implementation, resulting in considerable similarities between the two. You can use any of the quantifiers introduced in the previous POSIX section. The remainder of this section is devoted to a brief introduction of Perl regular expression syntax. Let’s start with a simple example of a Perl-based regular expression:

/food/

Notice that the string food is enclosed between two forward slashes. Just as with POSIX regular expressions, you can build a more complex string through the use of quantifiers:

/fo+/

This will match fo followed by one or more characters. Some potential matches include food , fool , and fo4 . Here is another example of using a quantifier:

/fo{2,4}/

This matches f followed by two to four occurrences of o . Some potential matches include fool , fooool , and foosball .

Modifiers

Often you’ll want to tweak the interpretation of a regular expression; for example, you may want to tell the regular expression to execute a case-insensitive search or to ignore comments embedded within its syntax. These tweaks are known as modifiers, and they go a long way toward helping you to write short and concise expressions. A few of the more interesting modifiers are outlined in Table 9-1.

Table 9-1. Six Sample Modifiers

Modifier

Description

i

Perform a case-insensitive search.

g

Find all occurrences (perform a global search).

m

Treat a string as several (mfor multiple) lines. By default, the ^and $characters match at the very start and very end of the string in question. Using the mmodifier will allow for ^and $to match at the beginning of any line in a string.

s

Treat a string as a single line, ignoring any newline characters found within; this accomplishes just the opposite of the mmodifier.

x

Ignore white space and comments within the regular expression.

U

Stop at the first match. Many quantifiers are "greedy"; they match the pattern as many times as possible rather than just stop at the first match. You can cause them to be "ungreedy" with this modifier.

These modifiers are placed directly after the regular expression—for instance, /string/i . Let’s consider a few examples:

/wmd/i : Matches WMD , wMD , WMd , wmd , and any other case variation of the string wmd .

/taxation/gi : Locates all occurrences of the word taxation. You might use the global modifier to tally up the total number of occurrences, or use it in conjunction with a replacement feature to replace all occurrences with some other string.

Metacharacters

Perl regular expressions also employ metacharacters to further filter their searches. A metacharacter is simply an alphabetical character preceded by a backslash that symbolizes special meaning. A list of useful metacharacters follows:

A : Matches only at the beginning of the string .

b : Matches a word boundary.

B : Matches anything but a word boundary.

d : Matches a digit character. This is the same as
[0-9].

D : Matches a nondigit character.

s : Matches a whitespace character.

S : Matches a nonwhitespace character.

[] : Encloses a character class.

() : Encloses a character grouping or defines a back reference.

$ : Matches the end of a line.

^ : Matches the beginning of a line.

. : Matches any character except for the newline.

: Quotes the next metacharacter.

w : Matches any string containing solely underscore and alphanumeric characters. This is the same as
[a-zA-Z0-9_].

W : Matches a string, omitting the underscore and alphanumeric characters.

Let’s consider a few examples. The first regular expression will match strings such as pisa and lisa but not sand:

/sab /

The next returns the first case-insensitive occurrence of the word linux :

/blinuxb/ i

The opposite of the word boundary metacharacter is B , matching on anything but a word boundary. Therefore this example will match strings such as sand and Sally but not Melissa :

/saB/

The final example returns all instances of strings matching a dollar sign followed by one or more digits:

/$d+g

Please check back next week for the continuation of this article.

[gp-comments width="770" linklove="off" ]

chat sex hikayeleri Ensest hikaye