MOVED TO CW APRIL 2012 String Processing (edited-lorraine)

Beginning Perl, Second Edition

Written by James Lee

note – fix table

Published by Apress

Chapter 9



Perl was created to be a text processing language, and it is arguably the most powerful text processing language around. One way that Perl displays its power in processing text is through its built-in regular expression support that we discussed in Chapter 7. Perl also has many built-in string operators (such as the string concatenation operator . and the string replication operator x ) and string functions. In this chapter we will explore several string functions and one very helpful string operator.

Character Position

Before we get started with some of Perl’s built-in functions, we should talk about the ability to access characters in a string by indexing into the string. The numeric position of a character in a string is known as its index. Recall that Perl is 0-based—it starts counting things from 0, and this applies to character indexing as will. So, for this string:

"Wish You Were Here"

here are the characters of the string and their indexes:

character 0:  W
character 1:  i
character 2:  s
character 3:  h
character 4:  <space>
character 5:  Y

character 17: e

We can also index characters by starting at the right-most character and counting down starting from –1 (much like accessing an array). Therefore, the characters in the preceding string are also known by these indices:

character -1:  e
character -2:  r
character -3:  e
character -4:  H
character -5:  <space>
character -6:  e

character -18: W

String Functions

Perl has many string functions built into the language. We will now discuss several of the most common functions used to process text.

The length() Function

To determine the length of a string, we can use the length() function.

length(string)

This function returns the number of characters of its argument. If no argument is given, length() returns the number of characters of $_ . Here is an example:

#!/usr/bin/perl -w
# length.pl

use strict;

my $song = ‘The Great Gig in the Sky’;
print ‘length of $song: ‘, length($song), "n";
# the *real* length is 4:44

$_ = ‘Us and Them’;
print ‘length of $_: ‘, length, "n";
# this one is 7:40

Running this code produces the following:

$ perl length.pl
length of $song: 24
length of $_: 11
$

The index() Function

The index() function locates substrings in strings. Its syntax is  

index(string, substring)

It returns the starting index (0-based) of where the substring is located in the string. If the substring is not found, it returns –1. This invocation:

index(‘Larry Wall’, ‘Wall’)

would return 6 since the substring “Wall” is contained within the string “Larry Wall” starting at position 6 (0-based, remember?). This invocation:

index(‘Pink Floyd’, ‘ink’);

would return 1.

The index() function has an optional third argument that indicates the starting position from which it should start looking. For instance, this invocation:

index(‘Roger Waters’, ‘er’, 0)

tells index() to try to locate the substring “er” in “Roger Waters” (www.roger-waters.com) and to start looking from position 0. Position 0 is the default, so it is not necessary to include it, but it is OK if you do. This function returns 3. If we provide another starting position as in

index(‘Roger Waters’, ‘er’, 5)

it tells index() to search for the substring “er” in “Roger Waters” but start searching from index 5. This returns 9 because it finds the “er” in “Waters”.

Here is an example illustrating the use of the index() function. It prompts the user for a string and then a substring and determines if the substring is in the string. If so, index() returns something other than –1, so we print that result to the user. Otherwise we inform the user that the substring was not found.

#! /usr/bin/perl -w
# index.pl

use strict;

print "Enter a string:    ";
chomp(my $string = <STDIN>);

print "Enter a substring: ";
chomp(my $substring = <STDIN>);

my $result = index($string, $substring);

if ($result != -1) {
   
print "the substring was found at index: $resultn";
} else {
   
print "the substring was not foundn";
}

Here is an example of running this program:

$ perl index.pl
Enter a string:     Perl is cool!
Enter a substring:  cool
the substring was found at index: 8
$ perl index.pl
Enter a string:     hello, world!
Enter a substring:  cool
the substring was not found
$

The rindex() Function

The rindex() function is similar to index() except that it searches the string from right to left (instead of left to right). The syntax is similar:  

rindex(string, substring)

This function searches right-to-left through the string searching for the substring. It returns the 0-based index of where the substring is in the string, or –1 if the substring is not found.

An important note: even though this function searches through the string from right to left, the 0th character of the string is still the left-most character.

This invocation:

rindex(‘David Gilmour’, ‘i’)

searches from the right-hand side of “David Gilmour” looking for the substring “i”. It finds it at position 7 (the “i” in “Gilmour”).

This function also has an optional third argument that is the character position from which it begins looking for the substring. This invocation:

rindex(‘David Gilmour’, ‘i’, 6)

starts at position 6 (the “G” in “Gilmour”) and looks right to left for an “i” and finds it at position 3.

The substr() Function

When processing text, we often have the situation where a string follows a specific column layout. For example, a string that contains a customer’s last name in columns 1–20, the last name in columns 21–40, and address in columns 40–70. We can use the substr() function to extract these fields out of the string. Its syntax is

substr(string, starting_index, length)

It returns length  number of characters starting from starting_index in string . If the number of characters extends beyond the length of the string, then it returns all the characters of the string from starting_index  to the end. For example, let’s say we have read a fixed-length record in from a file, and we know that from column 24 (0-based) to column 53 is the job title for that record. Here is an example line from the file:

‘John A.    Smith     Perl programmer’

If this record was read into the variable $record , this invocation would access John’s job:

$s = substr($record, 24, 30);

Since there is more than one way to do it in Perl (TMTOWTDI), this invocation of substr() can be performed with a regular expression:

($s) = $record =~ /^.{24}(.{1,30})/;

This statement matches the string literal $record against a regex that translates to “Match 24 of any character but ‘n’ at the beginning of the string followed between 1 and 30 of any character but ‘n’”. The parentheses around .{1,30} store those characters in $1 . Then an assignment is made to the list ($s) that copies over $1 and stores it into $s . As a result, $s is the string “Perl programmer”.

An interesting feature of the substr() function is that it can be on the left-hand side of an assignment. For instance, this code:

substr($record, 24, 30) = ‘Technical manager’;

would overwrite the substring of $record starting from position 24 length 30 ( John’s job, “Perl programmer”) with the string “Technical manager”. This results in $record being modified to be

‘John A.    Smith     Technical manager’

Is this a promotion or a demotion?

Here is an example of using substr() . It prompts the user for a string, a starting index, and a length and then prints the substring to the user. It then overwrites the first five charac ters of the string the user enters with the string “hello, world!” and prints the result:

#!/usr/bin/perl -w
# substr.pl

use strict;

print "Enter a string:       ";
chomp(my $string = <STDIN>);

print "Enter starting index: ";
chomp(my $index = <STDIN>);

print "Enter length:         ";
chomp(my $length = <STDIN>);

my $s = substr($string, $index, $length);

print "result: $sn";

# now, overwrite $string
substr($string, 0, 5) = ‘hello, world!’;

print "string is now: $stringn";

Here is an example of executing this code:

$ perl substr.pl
Enter a string:       practical extraction and report language
Enter starting index: 10
Enter length:         8
result: extracti
string is now: hello, world!ical extraction and report language
$

Transliteration

Now let’s look at another text processing operator—the transliteration operator. Its syntax is

tr/old/new/

This operator resembles the substitute operator, s/// , that we saw in Chapter 7 when we discussed regular expressions. While the tr/// operator resembles s/// , the only thing it has in common with the substitute is that both operators operate on $_ by default. The tr/// operator has nothing to do with regular expressions.

What this operator does is to correlate the characters in its two arguments, one by one, and use these pairings to substitute individual characters in the referenced string. The code tr/one/two/ replaces all instances of “o” in the referenced string with “t”, all instances of “n” with “w”, and all instances of “e” with “o”.

This operator translates the characters in $_ by default. To translate a string other than $_ , use the =~ operator as in

$string =~ tr/old/new/;

Let’s say you wanted to replace, for some reason, all the numbers in a string with letters. You might say something like this:

$string =~ tr/0123456789/abcdefghij/;

This would turn, say, “2011064” into “cabbage”. You can use ranges in transliteration, but not any of the character classes. We could write the preceding as

$string =~ tr/0-9/a-j/;

The return value of this operator is, by default, the number of characters transliterated. You can therefore use the transliteration operator to count the number of occurrences of certain characters. For example, to count the number of vowels in a string, you can use

my $vowels = $string =~ tr/aeiou//;

Note that this will not actually change any of the vowels in the variable $string —as the second group is blank, it is exactly the same as the first group. However, the transliteration operator can take the /d modifier, which will delete occurrences on the left that do not have a correlating character on the right. So, to get rid of all spaces in a string quickly, you could use this line:

$string =~ tr/ //d;

Here is an example program that loops through the diamond operator, reading line by line through either the file or files on the command line or standard input. For each line, the tr/// operator is used to uppercase the lowercase letters in $_ :  

#!/usr/bin/perl -w
# tr.pl

while (<>) {
   
tr/a-z/A-Z/;
   
print;
}

Here is an example of executing this program. We invoke it with no command line arguments so it reads though our standard input until end of file ( ^D in Unix, ^Z<enter> in Windows):

$ perl tr.pl
And
AND
she’s
SHE’S
buying
BUYING
a
A
stairway
STAIRWAY
^D
$

Summary

In this chapter we have discussed some very useful functions and operators to help us process text files. We determined the length of a string with length(). We worked with string indices and substrings with the functions index() , rindex() , and substr() . Finally, we looked at the transliteration operator, tr/// , which translates characters in a string.

Exercises

  1. Open ex1.dat in read mode. Each line of the file is a string with customer information. The information in the line is based on these character positions:

     

    1–24

    Customer name

    25–52

    Address

    53–72

    City

    73–74

    State

    76–80

    Zip code

    Print the information for each line so that it resembles

    Record:
    name     : John Q Public
    address  : 23 Main St.
    city     : Des Moines
    state    : IA
    zip      : 50309

  2. Write a program to perform the rot13 encoding algorithm. Rot13 is a simple encoding algorithm with the purpose of making text temporarily unreadable. It is called rot13 because it rotates alpha characters 13 positions in the alphabet. For instance, “a” is the first character of the alphabet and it is rotated 13 positions to the 14th character, “n”. The second character, “b”, is rotated to the 15th character “o” and so on through “m”, the 13th character rotated to “z”, the 26th character. When the 14th character, “n”, is rotated 13 positions, it rotates back around to “a”, “o” to “b”, and so on through “z” to “m”:

    a -> n      A -> N
    b -> o      B -> O
    …         …
    m -> z      M -> Z
    n -> a      N -> A
    o -> b      O -> B
    …         …
    z -> m      Z -> M

This program will read with the diamond. Execute the program like this:

$ perl ex2.pl ex2.dat

To double-check your work, take the standard output from the program and pipe it back into the standard input of the same program:

$ perl ex2.pl ex2.dat | perl ex2.pl

[gp-comments width="770" linklove="off" ]

chat sex hikayeleri Ensest hikaye