Home arrow BrainDump arrow Pipelines Can Do Amazing Things

Pipelines Can Do Amazing Things

In this two-part series, you will learn how to handle text processing jobs in Unix with pipelines. This article is excerpted from chapter 5 of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O'Reilly; ISBN: 0596005954). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

  1. Pipelines Can Do Amazing Things
  2. Extracting Data from Structured Text Files, continued
  3. 5.2 Structured Data for the Web
  4. Structured Data for the Web, continued
By: O'Reilly Media
Rating: starstarstarstarstar / 4
June 26, 2008

print this article



In this chapter, we solve several relatively simple text processing jobs. What’s interesting about all the examples here is that they are scripts built from simple pipelines: chains of one command hooked into another. Yet each one accomplishes a significant task.

When you tackle a text processing problem in Unix, it is important to keep the Unix tool philosophy in mind: ask yourself how the problem can be broken down into simpler jobs, for each of which there is already an existing tool, or for which you can readily supply one with a few lines of a shell program or with a scripting language.

5.1   Extracting Data from Structured Text Files

Most administrative files in Unix are simple flat text files that you can edit, print, and read without any special file-specific tools. Many of them reside in the standard directory, /etc. Common examples are the password and group files (passwd and group), the filesystem mount table (fstab or vfstab), the hosts file (hosts), the default shell startup file (profile), and the system startup and shutdown shell scripts, stored in the subdirectory trees rc0.d, rc1.d, and so on, through rc6.d. (There may be other directories as well.)

File formats are traditionally documented in Section 5 of the Unix manual, so the command man 5 passwd provides information about the structure of /etc/passwd.*

Despite its name, the password file must always be publicly readable. Perhaps it should have been called the user file because it contains basic information about every user account on the system, packed together in one line per account, with fields separated by colons. We described the file’s format in “Text File Conventions” [3.3.1]. Here are some typical entries:

  jones:*:32713:899:Adrian W. Jones/OSD211/555-0123:/home/jones:/bin/ksh
  dorothy:*:123:30:Dorothy Gale/KNS321/555-0044:/home/dorothy:/bin/bash
toto:*:1027:18:Toto Gale/KNS322/555-0045:/home/toto:/bin/tcsh
  ben:*:301:10:Ben Franklin/OSD212/555-0022:/home/ben:/bin/bash
  jhancock:*:1457:57:John Hancock/SIG435/555-0099:/home/jhancock:/bin/bash
  betsy:*:110:20:Betsy Ross/BMD17/555-0033:/home/betsy:/bin/ksh
  tj:*:60:33:Thomas Jefferson/BMD19/555-0095:/home/tj:/bin/bash
  george:*:692:42:George Washington/BST999/555-0001:/home/george:/bin/tcsh

To review, the seven fields of a password-file entry are:

  1. The username
  2. The encrypted password, or an indicator that the password is stored in a separate file
  3. The numeric user ID
  4. The numeric group ID
  5. The user’s personal name, and possibly other relevant data (office number, telephone number, and so on)
  6. The home directory
  7. The login shell

All but one of these fields have significance to various Unix programs. The one that does not is the fifth, which conventionally holds user information that is relevant only to local humans. Historically, it was called the gecos field, because it was added in the 1970s at Bell Labs when Unix systems needed to communicate with other computers running the General Electric Comprehensive Operating System, and some extra information about the Unix user was required for that system. Today, most sites use it just to record the personal name, so we simply call it the name field.

For the purposes of this example, we assume that the local site records extra information in the name field: a building and office number identifier (OSD211 in the first sample entry), and a telephone number (555-0123), separated from the personal name by slashes.

One obvious useful thing that we can do with such a file is to write some software to create an office directory. That way, only a single file, /etc/passwd, needs to be kept up-to-date, and derived files can be created when the master file is changed, or more sensibly, by a cron job that runs at suitable intervals. (We will discuss cron in “crontab: Rerun at Specified Times” [13.6.4].)

For our first attempt, we make the office directory a simple text file, with entries like this:

  Franklin, Ben             •OSD212•555-0022
  Gale, Dorothy             •KNS321•555-0044

where represents an ASCII tab character. We put the personal name in conventional directory order (family name first), padding the name field with spaces to a convenient fixed length. We prefix the office number and telephone with tab characters to preserve some useful structure that other tools can exploit.

Scripting languages, such as awk, were designed to make such tasks easy because they provide automated input processing and splitting of input records into fields, so we could write the conversion job entirely in such a language. However, we want to show how to achieve the same thing with other Unix tools.

For each password file line, we need to extract field five, split it into three subfields, rearrange the names in the first subfield, and then write an office directory line to a sorting process.

awk and cut are convenient tools for field extraction:

  ... | awk -F: '{ print $5 }' | ...
  ... | cut -d: -f5 | ...

There is a slight complication in that we have two field-processing tasks that we want to keep separate for simplicity, but we need to combine their output to make a directory entry. The join command is just what we need: it expects two input files, each ordered by a common unique key value, and joins lines sharing a common key into a single output line, with user control over which fields are output.

Since our directory entries contain three fields, to use join we need to create three intermediate files containing the colon-separated pairs key:person, key:office, and key:telephone, one pair per line. These can all be temporary files, since they are derived automatically from the password file.

What key do we use? It just needs to be unique, so it could be the record number in the original password file, but in this case it can also be the username, since we know that usernames are unique in the password file and they make more sense to humans than numbers do. Later, if we decide to augment our directory with additional information, such as job title, we can create another nontemporary file with the pair key:jobtitle and add it to the processing stages.

Instead of hardcoding input and output filenames into our program, it is more flexible to write the program as a filter so that it reads standard input and writes standard output. For commands that are used infrequently, it is advisable to give them descriptive, rather than short and cryptic, names, so we start our shell program like this:

  #! /bin/sh
  # Filter an input stream formatted like /etc/passwd,
  # and output an office directory derived from that data.
  # Usage:
  #       passwd-to-directory < /etc/passwd > office-directory-file
  #       ypcat passwd | passwd-to-directory > office-directory-file
  #       niscat passwd.org_dir | passwd-to-directory > office-directory-file

>>> More BrainDump Articles          >>> More By O'Reilly Media

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Apple Founder Steve Jobs Dies
- Steve Jobs` Era at Apple Ends
- Google's Chrome Developer Tool Updated
- Google's Chrome 6 Browser Brings Speed to th...
- New Open Source Update Fedora 13 is Released...
- Install Linux with Knoppix
- iPad Developers Flock To SDK 3.2
- Managing a Linux Wireless Access Point
- Maintaining a Linux Wireless Access Point
- Securing a Linux Wireless Access Point
- Configuring a Linux Wireless Access Point
- Building a Linux Wireless Access Point
- Migrating Oracle to PostgreSQL with Enterpri...
- Demystifying SELinux on Kernel 2.6
- Yahoo and Microsoft Create Ad Partnership

Developer Shed Affiliates


Dev Shed Tutorial Topics: