Converting Strings and Regular Expressions

In this third part of a five-part series on strings and regular expressions in PHP, you will learn how to convert strings to and from HTML, and more. This article is excerpted from chapter nine of the book Beginning PHP and Oracle: From Novice to Professional, written by W. Jason Gilmore and Bob Bryla (Apress; ISBN: 1590597702).

Converting Strings to and from HTML

Converting a string or an entire file into a form suitable for viewing on the Web (and vice versa) is easier than you would think. Several functions are suited for such tasks, all of which are introduced in this section.

Converting Newline Characters to HTML Break Tags

The nl2br() function converts all newline ( n ) characters in a string to their XHTML-compliant equivalent, <br /> . Its prototype follows:

string nl2br(string str)

The newline characters could be created via a carriage return, or explicitly written into the string. The following example translates a text string to HTML format:

    $recipe = "3 tablespoons Dijon mustard
    1/3 cup Caesar salad dressing
    8 ounces grilled chicken breast
    3 cups romaine lettuce";

    // convert the newlines to <br />’s.
    echo nl2br($recipe);

Executing this example results in the following output:

3 tablespoons Dijon mustard<br />
1/3 cup Caesar salad dressing<br />
8 ounces grilled chicken breast<br />
3 cups romaine lettuce

Converting Special Characters to their HTML Equivalents

During the general course of communication, you may come across many characters that are not included in a document’s text encoding, or that are not readily available on the keyboard. Examples of such characters include the copyright symbol (©), the cent sign (¢), and the grave accent (è). To facilitate such shortcomings, a set of universal key codes was devised, known as character entity references. When these entities are parsed by the browser, they will be converted into their recognizable counterparts. For example, the three aforementioned characters would be presented  as &copy;, &cent;, and &Egrave;, respectively.

To perform these conversions, you can use the htmlentities() function. Its prototype follows:

string htmlentities(string str [, int quote_style [, int charset]])

Because of the special nature of quote marks within markup, the optional quote_style parameter offers the opportunity to choose how they will be handled. Three values are accepted:

ENT_COMPAT : Convert double quotes and ignore single quotes. This is the default.

ENT_NOQUOTES : Ignore both double and single quotes.

ENT_QUOTES : Convert both double and single quotes.

A second optional parameter, charset , determines the character set used for the conversion. Table 9-2 offers the list of supported character sets. If charset is omitted, it will default to ISO-8859-1 .



Table 9-2. htmlentities()’s Supported Character Sets

Character Set Description
BIG5 Traditional Chinese

BIG5 with additional Hong Kong extensions, traditional Chinese

cp866 DOS-specific Cyrillic character set
cp1251 Windows-specific Cyrillic character set
cp1252 Windows-specific character set for Western Europe
EUC-JP Japanese
GB2312 Simplified Chinese
ISO-8859-1 Western European, Latin-1
ISO-8859-15 Western European, Latin-9
KOI8-R Russian
Shift-JIS Japanese
UTF-8 ASCII-compatible multibyte 8 encode



The following example converts the necessary characters for Web display:

    php $advertisement = "Coffee at ‘Cafè Française’ costs $2.25.";
    echo htmlentities($advertisement);

This returns the following:

Coffee at ‘Caf&egrave; Fran&ccedil;aise’ costs $2.25.


Two characters are converted, the grave accent (è) and the cedilla (ç). The single quotes are ignored due to the default quote_style setting ENT_COMPAT .

{mospagebreak title=Using Special HTML Characters for Other Purposes}

Several characters play a dual role in both markup languages and the human language. When used in the latter fashion, these characters must be converted into their displayable equivalents. For example, an ampersand must be converted to &amp; , whereas a greater-than character must be converted to &gt;.  The htmlspecialchars() function can do this for you, converting the following characters into their compatible equivalents. Its prototype follows:

string htmlspecialchars(string str [, int quote_style [, string charset]])

The list of characters that htmlspecialchars() can convert and their resulting formats follow:

  1. & becomes &amp;
  2. " (double quote) becomes &quot; 
  3. ‘ (single quote) becomes &#039; 
  4. < becomes &lt; 
  5. > becomes &gt;

This function is particularly useful in preventing users from entering HTML markup into an interactive Web application, such as a message board.

The following example converts potentially harmful characters using htmlspecialchars() :

$input = "I just can’t get <<enough> > of PHP!";
echo htmlspecialchars($input);

Viewing the source, you’ll see the following:

I just can’t get &lt;&lt;enough&gt;&gt; of PHP &amp!

If the translation isn’t necessary, perhaps a more efficient way to do this would be to use strip_tags() , which deletes the tags from the string altogether.

Tip  If you are using gethtmlspecialchars() in conjunction with a function such as nl2br() , you should execute nl2br() after gethtmlspecialchars() ; otherwise, the <br /> tags that are generated with nl2br() will be converted to visible characters.


Converting Text into Its HTML Equivalent

Using get_html_translation_table() is a convenient way to translate text to its HTML equivalent, returning one of the two translation tables ( HTML_SPECIALCHARS or HTML_ENTITIES ). Its prototype follows:

array get_html_translation_table(int table [, int quote_style])

This returned value can then be used in conjunction with another predefined function, strtr() (formally introduced later in this section), to essentially translate the text into its corresponding HTML code.

The following sample uses get_html_translation_table() to convert text to HTML:

$string = "La pasta é il piatto piú amato in Italia";
$translate = get_html_translation_table(HTML_ENTITIES);
echo strtr($string, $translate);

This returns the string formatted as necessary for browser rendering:

La pasta &eacute; il piatto pi&úacute; amato in Italia

Interestingly, array_flip() is capable of reversing the text-to-HTML translation and vice versa. Assume that instead of printing the result of strtr() in the preceding code sample, you assign it to the variable $translated_string .

The next example uses array_flip() to return a string back to its original value:

    $entities = get_html_translation_table(HTML_ENTITIES);
    $translate = array_flip($entities);
    $string = "La pasta &eacute; il piatto pi&uacute; amato in Italia";
    echo strtr($string, $translate);

This returns the following:

La pasta é il piatto piú amato in italia

{mospagebreak title=Creating a Customized Conversion List}

The strtr() function converts all characters in a string to their corresponding match found in a predefined array. Its prototype follows:

string strtr(string str, array replacements)

This example converts the deprecated bold ( <b> ) character to its XHTML equivalent:

    $table = array("<b>" => "<strong>", "</b>" => "</strong> ");
    $html = "<b>Today In PHP-Powered News</b>";
    echo strtr($html, $table);

This returns the following:

<strong>Today In PHP-Powered News</strong>


Converting HTML to Plain Text

You may sometimes need to convert an HTML file to plain text. You can do so using the strip_tags() function, which removes all HTML and PHP tags from a string, leaving only the text entities. Its prototype follows:

string strip_tags(string str [, string allowable_tags])

The optional allowable_tags parameter allows you to specify which tags you would like to be skipped during this process. This example uses strip_tags() to delete all HTML tags from a string:

$input = "Email <a href=’’></a>";
    echo strip_tags($input);

This returns the following:


The following sample strips all tags except the <a> tag:

    $input = "This <a href=’’>example</a>
              is <b>awesome</b>!";
    echo strip_tags($input, "<a>");

This returns the following:

This <a href=’’>example</a> is awesome!


Note  Another function that behaves like strip_tags() is fgetss() . This function is described in Chapter 10.  

{mospagebreak title=Alternatives for Regular Expression Functions}

When you’re processing large amounts of information, the regular expression functions can slow matters dramatically. You should use these functions only when you are interested in parsing relatively complicated strings that require the use of regular expressions. If you are instead interested in parsing for simple expressions, there are a variety of predefined functions that speed up the process considerably. Each of these functions is described in this section.

Tokenizing a String Based on Predefined Characters

The strtok() function parses the string based on a predefined list of characters. Its prototype follows:

string strtok(string str, string tokens)

One oddity about strtok() is that it must be continually called in order to completely tokenize a string; each call only tokenizes the next piece of the string. However, the str parameter needs to be specified only once because the function keeps track of its position in str until it either completely tokenizes str or a new str parameter is specified. Its behavior is best explained via an example:

    $info = "J.|Columbus, Ohio";

    // delimiters include colon (:), vertical bar (|), and comma (,)
    $tokens = ":|,";
    $tokenized = strtok($info, $tokens);

    // print out each element in the $tokenized array
    while ($tokenized) {
        echo "Element = $tokenized<br>";
        // Don’t include the first argument in subsequent calls.
        $tokenized = strtok($tokens);

This returns the following:

Element = J. Gilmore
Element =
Element = Columbus
Element = Ohio

Exploding a String Based on a Predefined Delimiter

The explode() function divides the string str into an array of substrings. Its prototype follows:

array explode(string separator, string str [, int limit])

The original string is divided into distinct elements by separating it based on the character separator specified by separator . The number of elements can be limited with the optional inclusion of limit . Let’s use explode() in conjunction with sizeof() and strip_tags() to determine the total number of words in a given block of text:

$summary = <<< summary
In the latest installment of the ongoing PHP series,
    I discuss the many improvements and additions to
<a href="">PHP 5′s</a> object-oriented architecture.
$words = sizeof(explode(‘ ‘,strip_tags($summary)));
echo "Total words in summary: $words";

This returns the following:

Total words in summary: 22

The explode() function will always be considerably faster than preg_split() , split() , and spliti() . Therefore, always use it instead of the others when a regular expression isn’t necessary.

Note  You might be wondering why the previous code is indented in an inconsistent manner. The multiple-line string was delimited using heredoc syntax, which requires the closing identifier to not be indented even a single space. Why this restriction is in place is somewhat of a mystery, although one would presume it makes the PHP engine’s job a tad easier when parsing the multiple-line string. See Chapter 3 for more information about heredoc.

Please check back next week for the next part of this article.

Google+ Comments

Google+ Comments