String Manipulation - Regular Expressions (
Page 4 of 4 )
Regular expressions are a very powerful tool in any language. They allow
patterns to be matched against strings. Actions such as replacement can be
performed on the string if the regular expression pattern matches. Python's
module for regular expressions is the re module. Open the Python
interactive interpreter, and let's take a closer look at regular expressions and
the re module:
>>> import
re
Let's create a simple string we can use to play around with:
>>> test = 'This is for
testing regular expressions in Python.'
I spoke of matching special patterns with regular expressions, but let's
start with matching a simple string just to get used to regular expressions.
There are two methods for matching patterns in strings in the re module:
search and match. Let's take a look at search first. It
works like this:
>>> result = re.search (
'This', test )
We can extract the results using the group method:
>>> result.group ( 0
)
'This'
You're probably wondering about the group method right now and why we
pass zero to it. It's simple, and I'll explain. You see, patterns can be
organized into groups, like this:
>>> result = re.search (
'(Th)(is)', test )
There are two groups surrounded by parenthesis. We can extract them using the
group method:
>>> result.group ( 1
)
'Th'
>>> result.group ( 2
)
'is'
Passing zero to the method returns both of the groups:
>>> result.group ( 0
)
'This'
The benefit of groups will become more clear once we work our way into actual
patterns. First, though, let's take a look at the match function. It works
similarly, but there is a crucial difference:
>>> result = re.match
( 'This', test )
>>>
print result
<_sre.SRE_Match object at
0x00994250>
>>>
print result.group ( 0 )
'This'
>>> result = re.match ( 'regular',
test )
>>> print
result
None
Notice that None was returned, even though “regular” is in the string.
If you haven't figured it out, the match method matches patterns at the
beginning of the string, and the search function examines the whole
string. You might be wondering if it's possible, then, to make the match
method match “regular,” since it's not at the beginning of the string. The
answer is yes. It's possible to match it, and that brings us into patterns.
The character “.” will match any character. We can get the match
method to match “regular” by putting a period for every letter before it. Let's
split this up into two groups as well. One will contain the periods, and one
will contain “regular”:
>>> result = re.match (
'(....................)(regular)', test )
>>> result.group ( 0
)
'This is for testing
regular'
>>>
result.group ( 1 )
'This is
for testing '
>>>
result.group ( 2 )
'regular'
Aha! We matched it! However, it's ridiculous to have to type in all those
periods. The good news is that we don't have to do that. Take a look at this and
remember that there are twenty characters before “regular”:
>>> result = re.match (
'(.{20})(regular)', test )
>>> result.group ( 0
)
'This is for testing
regular'
>>>
result.group ( 1 )
'This is
for testing '
>>>
result.group ( 2 )
'regular'
That's a lot easier. Now let's look at a few more patterns. Here's how you
can use brackets in a more advanced way:
>>> result = re.match (
'(.{10,20})(regular)', test )
>>> result.group ( 0
)
'This is for testing
regular'
>>> result =
re.match ( '(.{10,20})(testing)', test )
'This is for testing'
By entering two arguments, so to speak, you can match any number of
characters in a range. In this case, that range is 10-20. Sometimes, however,
this can cause undesired behavior. Take a look at this string:
>>> anotherTest = 'a cat, a
dog, a goat, a person'
Let's match a range of characters:
>>> result = re.match (
'(.{5,20})(,)', anotherTest )
>>> result.group ( 1
)
'a cat, a dog, a
goat'
What if we only want “a cat” though? This can be done with appending “?” to
the end of the brackets:
>>> result = re.match (
'(.{5,20}?)(,)', anotherTest )
>>> result.group ( 1
)
'a
cat'
Appending a question mark to something makes it match as few characters as
possible. A question mark that does that, though, is not to be confused with
this pattern:
>>> anotherTest =
'012345'
>>> result =
re.match ( '01?', anotherTest )
>>> result.group ( 0
)
'01'
>>> result = re.match ( '0123456?',
anotherTest )
>>>
result.group ( 0 )
'012345'
As you can see with the example, the character before a question mark is
optional. Next is the “*” pattern. It matches one or more of the characters
it follows, like this:
>>> anotherTest = 'Just a
silly string.'
>>>
result = re.match ( '(.*)(a)(.*)(string)', anotherTest )
>>> result.group ( 0
)
'Just a silly
string'
However, take a look at this:
>>> anotherTest = 'Just a
silly string. A very silly string.'
>>> result = re.match (
'(.*)(a)(.*)(string)', anotherTest )
>>> result.group ( 0
)
'Just a silly string. A very
silly string'
What if, however, we want to only match the first sentence? If you've been
following along closely, you'll know that “?” will, again, do the trick:
>>> result = re.match (
'(.*?)(a)(.*?)(string)', anotherTest )
>>> result.group ( 0
)
'Just a silly
string'
As I mentioned earlier, though, “*” doesn't have to match anything:
>>> result = re.match (
'(.*?)(01)', anotherTest )
>>> result.group ( 0
)
'01'
What if we want to skip past the first two characters? This is possible by
using “+”, which is similar to “*”, except that it matches at least one
character:
>>> result = re.match (
'(.+?)(01)', anotherTest )
>>> result.group ( 0
)
'0101'
We can also match a range of characters. For example, we can match only the
first four letters of the alphabet:
>>> anotherTest =
'a101'
>>> result =
re.match ( '[a-d]', anotherTest )
>>> print result
<_sre.SRE_Match object at
0x00B47B10>
>>>
anotherTest = 'q101'
>>> result = re.match ( '[a-d]',
anotherTest )
>>>
print result
None
We can also match one of a few patterns using “|”::
>>> testA =
'a'
>>> testB =
'b'
>>> result =
re.match ( '(a|b)', testA )
>>> print result
<_sre.SRE_Match object at
0x00B46D60>
>>>
result = re.match ( '(a|b)', testB )
>>> print result
<_sre.SRE_Match object at
0x00B46E60>
Finally, there are a number of special sequences. “\A” matches at the start
of a string. “\Z” matches at the end of a string. “\d” matches a digit. “\D”
matches anything but a digit. “\s” matches whitespace. “\S” matches anything but
whitespace.
We can name our
groups:
>>> nameTest = 'hot
sauce'
>>> result =
re.match ( '(?P<one>hot)', nameTest )
>>> result.group ( 'one'
)
'hot'
We can compile patterns to use them multiple times with the re module,
too:
>>> ourPattern = re.compile
( '(.*?)(the)' )
>>>
testString = 'This is the dog and the cat.'
>>> result = ourPattern.match (
testString )
>>>
result.group ( 0 )
'This is
the'
Of course, you can do more than match and extract substrings. You can replace
things, too:
>>> someString = 'I have a
dream.'
>>> re.sub (
'dream', 'dog', someString )
'I have a dog.'
On a final note, you should not use regular expressions to match or replace
simple strings.
Conclusion
Now you have a basic knowledge of string manipulation in Python behind you.
As I explained at the very beginning of the article, string manipulation is
necessary to many applications, both large and small. It is used frequently, and
a basic knowledge of it is critical.