Pipelines Can Do Amazing Things - Extracting Data from Structured Text Files, continued (
Page 2 of 4 )
Since the password file is publicly readable, any data derived from it is public as well, so there is no real need to restrict access to our program’s intermediate files. However, because all of us at times have to deal with sensitive data, it is good to develop the programming habit of allowing file access only to those users or processes that need it. We therefore reset the umask (see “Default permissions” in Appendix B) as the first action in our program:
umask 077 Restrict temporary file access to just us
For accountability and debugging, it is helpful to have some commonality in temporary filenames, and to avoid cluttering the current directory with them: we name them with the prefix /tmp/pd.. To guard against name collisions if multiple instances of our program are running at the same time, we also need the names to be unique: the process number, available in the shell variable $$, provides a distinguishing suffix. (This use of $$ is described in more detail in Chapter 10.) We therefore define these shell variables to represent our temporary files:
PERSON=/tmp/pd.key.person.$$ Unique temporary filenames
OFFICE=/tmp/pd.key.office.$$
TELEPHONE=/tmp/pd.key.telephone.$$
USER=/tmp/pd.key.user.$$
When the job terminates, either normally or abnormally, we want the temporary files to be deleted, so we use the trap command:
trap "exit 1" HUP INT PIPE QUIT TERM
trap "rm -f $PERSON $OFFICE $TELEPHONE $USER" EXIT
During development, we can just comment out the second trap, preserving temporary files for subsequent examination. (The trap command is described in “Trapping Process Signals” [13.3.2]. For now, it’s enough to understand that when the script exits, the trap command arranges to automatically run rm with the given arguments.)
We need fields one and five repeatedly, and once we have them, we don’t require the input stream from standard input again, so we begin by extracting them into a temporary file:
awk -F: '{ print $1 ":" $5 }' >
$USER This reads standard input
We make the key:person pair file first, with a two-step sed program followed by a simple line sort; the sort command is discussed in detail in “Sorting Text” [4.1].
sed -e 's=/.*==' \
-e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' <$USER | sort >$PERSON
The script uses = as the separator character for sed’s s command, since both slashes and colons appear in the data. The first edit strips everything from the first slash to the end of the line, reducing a line like this:
jones:Adrian W. Jones/OSD211/
555-0123 Input line
to this:
jones:Adrian W. Jones Result of first edit
The second edit is more complex, matching three subpatterns in the record. The first part, ^\([^:]*\), matches the username field (e.g., jones). The second part, \(.*\)❒, matches text up to a space (e.g., Adrian❒W.❒; the ❒ stands for a space character). The last part, \([^❒]*\), matches the remaining nonspace text in the record (e.g., Jones). The replacement text reorders the matches, producing something like Jones, ❒Adrian W. The result of this single sed command is the desired reordering:
jones:Jones, Adrian W. Printed result of second edit
Next, we make the key:office pair file:
sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' < $USER | sort > $OFFICE
The result is a list of users and offices:
jones:OSD211
The key:telephone pair file creation is similar: we just need to adjust the match pattern:
sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)=\1:\2=' < $USER | sort > $TELEPHONE
At this stage, we have three separate files, each of which is sorted. Each file consists of the key (the username), a colon, and the particular data (personal name, office, telephone number). The $PERSON file’s contents look like this:
ben:Franklin, Ben
betsy:Ross, Betsy
...
The $OFFICE file has username and office data:
ben:OSD212
betsy:BMD17
...
The $TELEPHONE file records usernames and telephone numbers:
ben:555-0022
betsy:555-0033
...
By default, join outputs the common key, then the remaining fields of the line from the first file, followed by the remaining fields of the line from the second line. The common key defaults to the first field, but that can be changed by a command-line option: we don’t need that feature here. Normally, spaces separate fields for join, but we can change the separator with its –t option: we use it as –t:.
The join operations are done with a five-stage pipeline, as follows:
- Combine the personal information and the office location:
join -t: $PERSON $OFFICE | ...
The results of this operation, which become the input to the next stage, look like this:
ben:Franklin, Ben:OSD212
betsy:Ross, Betsy:BMD17
...
- Add the telephone number:
... | join -t: - $TELEPHONE | ...
The results of this operation, which become the input to the next stage, look like this:
ben:Franklin, Ben:OSD212:555-0022
betsy:Ross, Betsy:BMD17:555-0033
...
-
Remove the key (which is the first field), since it’s no longer needed. This is most easily done with cut and a range that says “use fields two through the end,” like so:
... | cut -d: -f 2- | ...
The results of this operation, which become the input to the next stage, look like this:
Franklin, Ben:OSD212:555-0022
Ross, Betsy:BMD17:555-0033
...
-
Re-sort the data. The data was previously sorted by login name, but now things need to be sorted by personal last name. This is done with sort:
... | sort -t: -k1,1 -k2,2 -k3,3 | ...
This command uses a colon to separate fields, sorting on fields 1, 2, and 3, in order. The results of this operation, which become the input to the next stage, look like this:
Franklin, Ben:OSD212:555-0022
Gale, Dorothy:KNS321:555-0044
...
-
Finally, reformat the output, using awk’s printf statement to separate each field with tab characters. The command to do this is:
... | awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'
For flexibility and ease of maintenance, formatting should always be left until the end. Up to that point, everything is just text strings of arbitrary length.
Here’s the complete pipeline:
join -t: $PERSON $OFFICE |
join -t: - $TELEPHONE |
cut -d: -f 2- |
sort -t: -k1,1 -k2,2 -k3,3 |
awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'
The awk printf statement used here is similar enough to the shell printf command that its meaning should be clear: print the first colon-separated field left-adjusted in a 39-character field, followed by a tab, the second field, another tab, and the third field. Here are the full results:
Franklin, Ben •OSD212•555-0022
Gale, Dorothy •KNS321•555-0044
Gale, Toto •KNS322•555-0045
Hancock, John •SIG435•555-0099
Jefferson, Thomas •BMD19•555-0095
Jones, Adrian W. •OSD211•555-0123
Ross, Betsy •BMD17•555-0033
Washington, George •BST999•555-0001
That is all there is to it! Our entire script is slightly more than 20 lines long, excluding comments, with five main processing steps. We collect it together in one place in Example 5-1.
Example 5-1. Creating an office directory
#! /bin/sh
# Filter an input stream formatted like /etc/passwd,
# and output an office directory derived from that data.
#
# Usage:
# passwd-to-directory < /etc/passwd > office-directory-file
# ypcat passwd | passwd-to-directory > office-directory-file
# niscat passwd.org_dir | passwd-to-directory > office-directory-file
umask 077
PERSON=/tmp/pd.key.person.$$ OFFICE=/tmp/pd.key.office.$$ TELEPHONE=/tmp/pd.key.telephone.$$ USER=/tmp/pd.key.user.$$
trap "exit 1" HUP INT PIPE QUIT TERM trap "rm -f $PERSON $OFFICE $TELEPHONE $USER" EXIT
awk -F: '{ print $1 ":" $5 }' > $USER
sed -e 's=/.*==' \
-e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' < $USER | sort > $PERSON
sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' < $USER | sort > $OFFICE
sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)=\1:\2=' < $USER | sort > $TELEPHONE
join -t: $PERSON $OFFICE |
join -t: - $TELEPHONE |
cut -d: -f 2- |
sort -t: -k1,1 -k2,2 -k3,3 |
awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'
The real power of shell scripting shows itself when we want to modify the script to do a slightly different job, such as insertion of the job title from a separately
maintained key:jobtitle file. All that we need to do is modify the final pipeline to look something like this:
join -t: $PERSON /etc/passwd.job-title | Extra join with job title
join -t: - $OFFICE |
join -t: - $TELEPHONE |
cut -d: -f 2- |
sort -t: -k1,1 -k3,3 -k4,4 | Modify sort command
awk -F: '{ printf("%-39s\t%-23s\t%s\t%s\n",
$1, $2, $3, $4) }' And formatting command
The total cost for the extra directory field is one more join, a change in the sort fields, and a small tweak in the final awk formatting command.
Because we were careful to preserve special field delimiters in our output, we can trivially prepare useful alternative directories like this:
passwd-to-directory < /etc/passwd | sort -t'•' -k2,2 > dir.by-office
passwd-to-directory < /etc/passwd | sort -t'•' -k3,3 > dir.by-telephone
As usual, • represents an ASCII tab character.
A critical assumption of our program is that there is a unique key for each data record. With that unique key, separate views of the data can be maintained in files as key:value pairs. Here, the key was a Unix username, but in larger contexts, it could be a book number (ISBN), credit card number, employee number, national retirement system number, part number, student number, and so on. Now you know why we get so many numbers assigned to us! You can also see that those handles need not be numbers: they just need to be unique text strings.