Strings and regular expressions are among the basic tools that help programmers get their jobs done. This five-part article series covers how these are used in PHP. It is excerpted from chapter nine of the book Beginning PHP and Oracle: From Novice to Professional, written by W. Jason Gilmore and Bob Bryla (Apress; ISBN: 1590597702).
The structure of a POSIX regular expression is similar to that of a typical arithmetic expression: various elements (operators) are combined to form a more complex expression. The meaning of the combined regular expression elements is what makes them so powerful. You can locate not only literal expressions, such as a specific word or number, but also a multitude of semantically different but syntactically similar strings, such as all HTML tags in a file.
Note POSIX stands for Portable Operating System Interface for Unix, and is representative of a set of standards originally intended for Unix-based operating systems. POSIX regular expression syntax is an attempt to standardize how regular expressions are implemented in many programming languages.
The simplest regular expression is one that matches a single character, such asg, which would match strings such asgog,haggle, andbag. You could combine several letters together to form larger expressions, such asgan, which logically would match any string containinggan:gang,organize, orReagan, for example.
You can also test for several different expressions simultaneously by using the pipe (|) character. For example, you could test forphporzendvia the regular expressionphp|zend.
Before getting into PHP’s POSIX-based regular expression functions, let’s review three methods that POSIX supports for locating different character sequences: brackets, quantifiers, and predefined character ranges.
Brackets
Brackets ([]) are used to represent a list, or range, of characters to be matched. For instance, contrary to the regular expressionphp, which will locate strings containing the explicit stringphp, the regular expression[php]will find any string containing the characterp orh. Several commonly used character ranges follow:
[0-9]matches any decimal digit from0through9.
[a-z]matches any character from lowercaseathrough lowercasez.
[A-Z]matches any character from uppercaseA through uppercaseZ.
[A-Za-z]matches any character from uppercaseAthrough lowercasez.
Of course, the ranges shown here are general; you could also use the range[0-3]to match any decimal digit ranging from0through3, or the range[b-v]to match any lowercase character ranging frombthroughv. In short, you can specify any ASCII range you wish.
Quantifiers
Sometimes you might want to create regular expressions that look for characters based on their frequency or position. For example, you might want to look for strings containing one or more instances of the letterp, strings containing at least twop’s, or even strings with the letterpas their beginning or terminating character. You can make these demands by inserting special characters into the regular expression. Here are several examples of these characters:
p+matches any string containing at least onep.
p*matches any string containing zero or more p’s.
p?matches any string containing zero or onep.
p{2}matches any string containing a sequence of twop’s.
p{2,3}matches any string containing a sequence of two or threep’s.
p{2,}matches any string containing a sequence of at least twop’s.
p$matches any string withpat the end of it.
Still other flags can be inserted before and within a character sequence:
^pmatches any string withpat the beginning of it.
[^a-zA-Z]matches any string not containing any of the characters ranging fromathroughzandA throughZ.
p.pmatches any string containingp, followed by any character, in turn followed by anotherp.
You can also combine special characters to form more complex expressions. Consider the following examples:
^.{2}$matches any string containing exactly two characters.
<b>(.*)</b>matches any string enclosed within<b>and</b>.
p(hp)*matches any string containing apfollowed by zero or more instances of the sequencehp.
You may wish to search for these special characters in strings instead of using them in the special context just described. To do so, the characters must be escaped with a backslash (\). For example, if you want to search for a dollar amount, a plausible regular expression would be as follows:([\$])([0-9]+); that is, a dollar sign followed by one or more integers. Notice the backslash preceding the dollar sign. Potential matches of this regular expression include$42,$560and$3.
Predefined Character Ranges (Character Classes)
For reasons of convenience, several predefined character ranges, also known as character classes, are available. Character classes specify an entire range of characters—for example, the alphabet or an integer set. Standard classes include the following:
[:alpha:]: Lowercase and uppercase alphabetical characters. This can also be specified as [A-Za-z].
[:alnum:]: Lowercase and uppercase alphabetical characters and numerical digits. This can also be specified as[A-Za-z0-9].
[:cntrl:]: Control characters such as tab, escape, or backspace.
[:digit:]: Numerical digits 0 through 9. This can also be specified as[0-9].
[:graph:]:Printable characters found in the range of ASCII 33 to 126.
[:lower:]: Lowercase alphabetical characters. This can also be specified as[a-z].