When you tackle a text processing problem in Unix, it is important to keep the Unix tool philosophy in mind: ask yourself how the problem can be broken down into simpler jobs, for each of which there is already an existing tool, or for which you can readily supply one with a few lines of a shell program or with a scripting language. 5.1 Extracting Data from Structured Text Files Most administrative files in Unix are simple flat text files that you can edit, print, and read without any special file-specific tools. Many of them reside in the standard directory, /etc. Common examples are the password and group files (passwd and group), the filesystem mount table (fstab or vfstab), the hosts file (hosts), the default shell startup file (profile), and the system startup and shutdown shell scripts, stored in the subdirectory trees rc0.d, rc1.d, and so on, through rc6.d. (There may be other directories as well.) File formats are traditionally documented in Section 5 of the Unix manual, so the command man 5 passwd provides information about the structure of /etc/passwd.* Despite its name, the password file must always be publicly readable. Perhaps it should have been called the user file because it contains basic information about every user account on the system, packed together in one line per account, with fields separated by colons. We described the file’s format in “Text File Conventions” [3.3.1]. Here are some typical entries: jones:*:32713:899:Adrian W. Jones/OSD211/555-0123:/home/jones:/bin/ksh To review, the seven fields of a password-file entry are:
All but one of these fields have significance to various Unix programs. The one that does not is the fifth, which conventionally holds user information that is relevant only to local humans. Historically, it was called the gecos field, because it was added in the 1970s at Bell Labs when Unix systems needed to communicate with other computers running the General Electric Comprehensive Operating System, and some extra information about the Unix user was required for that system. Today, most sites use it just to record the personal name, so we simply call it the name field. For the purposes of this example, we assume that the local site records extra information in the name field: a building and office number identifier (OSD211 in the first sample entry), and a telephone number (555-0123), separated from the personal name by slashes. One obvious useful thing that we can do with such a file is to write some software to create an office directory. That way, only a single file, /etc/passwd, needs to be kept up-to-date, and derived files can be created when the master file is changed, or more sensibly, by a cron job that runs at suitable intervals. (We will discuss cron in “crontab: Rerun at Specified Times” [13.6.4].) For our first attempt, we make the office directory a simple text file, with entries like this: Franklin, Ben •OSD212•555-0022 where • represents an ASCII tab character. We put the personal name in conventional directory order (family name first), padding the name field with spaces to a convenient fixed length. We prefix the office number and telephone with tab characters to preserve some useful structure that other tools can exploit. Scripting languages, such as awk, were designed to make such tasks easy because they provide automated input processing and splitting of input records into fields, so we could write the conversion job entirely in such a language. However, we want to show how to achieve the same thing with other Unix tools. For each password file line, we need to extract field five, split it into three subfields, rearrange the names in the first subfield, and then write an office directory line to a sorting process. awk and cut are convenient tools for field extraction: ... | awk -F: '{ print $5 }' | ... There is a slight complication in that we have two field-processing tasks that we want to keep separate for simplicity, but we need to combine their output to make a directory entry. The join command is just what we need: it expects two input files, each ordered by a common unique key value, and joins lines sharing a common key into a single output line, with user control over which fields are output. Since our directory entries contain three fields, to use join we need to create three intermediate files containing the colon-separated pairs key:person, key:office, and key:telephone, one pair per line. These can all be temporary files, since they are derived automatically from the password file. What key do we use? It just needs to be unique, so it could be the record number in the original password file, but in this case it can also be the username, since we know that usernames are unique in the password file and they make more sense to humans than numbers do. Later, if we decide to augment our directory with additional information, such as job title, we can create another nontemporary file with the pair key:jobtitle and add it to the processing stages. Instead of hardcoding input and output filenames into our program, it is more flexible to write the program as a filter so that it reads standard input and writes standard output. For commands that are used infrequently, it is advisable to give them descriptive, rather than short and cryptic, names, so we start our shell program like this: #! /bin/sh
blog comments powered by Disqus |
|
|
|
|
|
|
|