So far, we’ve looked at the string object’s sequence operations and type-specific methods. Python also provides a variety of ways for us to code strings, which we’ll explore further later (with special characters represented as backslash escape sequences, for instance):
>>> S = 'A\nB\tC' # \n is end-of-line, \t is tab >>> len(S) # Each stands for just one character 5
>>> ord('\n') # \n is a byte with the binary value 10 in ASCII 10
>>> S = 'A\0B\0C' # \0, the binary zero byte, does not terminate the string >>> len(S) 5
Python allows strings to be enclosed in single or double quote characters (they mean the same thing). It also has a multiline string literal form enclosed in triple quotes (single or double)—when this form is used, all the lines are concatenated together, and end-of-line characters are added where line breaks appear. This is a minor syntactic convenience, but it’s useful for embedding things like HTML and XML code in a Python script:
Python also supports a “raw” string literal that turns off the backslash escape mechanism (they start with the letter r), as well as a Unicode string form that supports internationalization (they begin with the letter u and contain multibyte characters). Technically, Unicode string is a different data type than normal string, but it supports all the same string operations. We’ll meet all these special string forms in later chapters.
Pattern Matching
One point worth noting before we move on is that none of the string object’s methods support pattern-based text processing. Text pattern matching is an advanced tool outside this book’s scope, but readers with backgrounds in other scripting languages may be interested to know that to do pattern matching in Python, we import a module called re. This module has analogous calls for searching, splitting, and replacement, but because we can use patterns to specify substrings, we can be much more general:
>>> import re >>> match = re.match('Hello[ \t]*(.*)world', 'Hello Python world') >>> match.group(1) 'Python'
This example searches for a substring that begins with the word “Hello,” followed by zero or more tabs or spaces, followed by arbitrary characters to be saved as a matched group, terminated by the word “world.” If such as substring is found, portions of the substring matched by parts of the pattern enclosed in parentheses are available as groups. The following pattern, for example, picks out three groups separated by slashes:
>>> match = re.match('/(.*)/(.*)/(.*)', '/usr/home/lumberjack') >>> match.groups() ('usr', 'home', 'lumberjack')
Pattern matching is a fairly advanced text-processing tool by itself, but there is also support in Python for even more advanced language processing, including natural language processing. I’ve already said enough about strings for this tutorial, though, so let’s move on to the next type.