Home arrow BrainDump arrow Page 2 - Pipelines Can Do Amazing Things

Extracting Data from Structured Text Files, continued - BrainDump

In this two-part series, you will learn how to handle text processing jobs in Unix with pipelines. This article is excerpted from chapter 5 of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O'Reilly; ISBN: 0596005954). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

TABLE OF CONTENTS:
  1. Pipelines Can Do Amazing Things
  2. Extracting Data from Structured Text Files, continued
  3. 5.2 Structured Data for the Web
  4. Structured Data for the Web, continued
By: O'Reilly Media
Rating: starstarstarstarstar / 4
June 26, 2008

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement
 

Since the password file is publicly readable, any data derived from it is public as well, so there is no real need to restrict access to our program’s intermediate files. However, because all of us at times have to deal with sensitive data, it is good to develop the programming habit of allowing file access only to those users or processes that need it. We therefore reset the umask (see “Default permissions” in Appendix B) as the first action in our program:

  umask 077                                Restrict temporary file access to just us

For accountability and debugging, it is helpful to have some commonality in temporary filenames, and to avoid cluttering the current directory with them: we name them with the prefix /tmp/pd.. To guard against name collisions if multiple instances of our program are running at the same time, we also need the names to be unique: the process number, available in the shell variable $$, provides a distinguishing suffix. (This use of $$ is described in more detail in Chapter 10.) We therefore define these shell variables to represent our temporary files:

  PERSON=/tmp/pd.key.person.$$                 Unique temporary filenames   
  OFFICE=/tmp/pd.key.office.$$
  TELEPHONE=/tmp/pd.key.telephone.$$
  USER=/tmp/pd.key.user.$$

When the job terminates, either normally or abnormally, we want the temporary files to be deleted, so we use the trap command:

  trap "exit 1"       HUP INT PIPE QUIT TERM
  trap "rm -f $PERSON $OFFICE $TELEPHONE $USER" EXIT

During development, we can just comment out the second trap, preserving temporary files for subsequent examination. (The trap command is described in “Trapping Process Signals” [13.3.2]. For now, it’s enough to understand that when the script exits, the trap command arranges to automatically run rm with the given arguments.)

We need fields one and five repeatedly, and once we have them, we don’t require the input stream from standard input again, so we begin by extracting them into a temporary file:

  awk -F: '{ print $1 ":" $5 }' >
$USER
                                       This reads standard input

We make the key:person pair file first, with a two-step sed program followed by a simple line sort; the sort command is discussed in detail in “Sorting Text” [4.1].

  sed -e 's=/.*==' \
      -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' <$USER | sort >$PERSON

The script uses = as the separator character for sed’s s command, since both slashes and colons appear in the data. The first edit strips everything from the first slash to the end of the line, reducing a line like this:

  jones:Adrian W. Jones/OSD211/
555-0123  
                                                  Input line

to this:

  jones:Adrian W. Jones               Result of first edit

The second edit is more complex, matching three subpatterns in the record. The first part, ^\([^:]*\), matches the username field (e.g., jones). The second part, \(.*\)❒, matches text up to a space (e.g., Adrian❒W.❒; the stands for a space character). The last part, \([^❒]*\), matches the remaining nonspace text in the record (e.g., Jones). The replacement text reorders the matches, producing something like Jones, ❒Adrian W. The result of this single sed command is the desired reordering:

jones:Jones, Adrian W.          Printed result of second edit

Next, we make the key:office pair file:

  sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' < $USER | sort > $OFFICE

The result is a list of users and offices:

  jones:OSD211

The key:telephone pair file creation is similar: we just need to adjust the match pattern:

  sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)=\1:\2=' < $USER | sort > $TELEPHONE

At this stage, we have three separate files, each of which is sorted. Each file consists of the key (the username), a colon, and the particular data (personal name, office, telephone number). The $PERSON file’s contents look like this:

  ben:Franklin, Ben
  betsy:Ross, Betsy
  ...

The $OFFICE file has username and office data:

  ben:OSD212
  betsy:BMD17
  ...

The $TELEPHONE file records usernames and telephone numbers:

  ben:555-0022
 
betsy:555-0033
  ...

By default, join outputs the common key, then the remaining fields of the line from the first file, followed by the remaining fields of the line from the second line. The common key defaults to the first field, but that can be changed by a command-line option: we don’t need that feature here. Normally, spaces separate fields for join, but we can change the separator with its –t option: we use it as –t:.

The join operations are done with a five-stage pipeline, as follows:

  1. Combine the personal information and the office location:

      join -t: $PERSON $OFFICE | ...

    The results of this operation, which become the input to the next stage, look like this:

      ben:Franklin, Ben:OSD212
      betsy:Ross, Betsy:BMD17
      ...
     
  2. Add the telephone number:

      ... | join -t: - $TELEPHONE | ...

    The results of this operation, which become the input to the next stage, look like this:

      ben:Franklin, Ben:OSD212:555-0022
      betsy:Ross, Betsy:BMD17:555-0033
      ...
     
  3. Remove the key (which is the first field), since it’s no longer needed. This is most easily done with cut and a range that says “use fields two through the end,” like so:

      ... | cut -d: -f 2- | ...

    The results of this operation, which become the input to the next stage, look like this:

      Franklin, Ben:OSD212:555-0022
      Ross, Betsy:BMD17:555-0033
      ...
     
  4. Re-sort the data. The data was previously sorted by login name, but now things need to be sorted by personal last name. This is done with sort

      ... | sort -t: -k1,1 -k2,2 -k3,3 | ...

    This command uses a colon to separate fields, sorting on fields 1, 2, and 3, in order. The results of this operation, which become the input to the next stage, look like this:

      Franklin, Ben:OSD212:555-0022
      Gale, Dorothy:KNS321:555-0044
      ... 
  5. Finally, reformat the output, using awk’s printf statement to separate each field with tab characters. The command to do this is:

      ... | awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

    For flexibility and ease of maintenance, formatting should always be left until the end. Up to that point, everything is just text strings of arbitrary length.

Here’s the complete pipeline:

  join -t: $PERSON $OFFICE |
      join -t: - $TELEPHONE |
          cut -d: -f 2- |
              sort -t: -k1,1 -k2,2 -k3,3 |
                  awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

The awk printf statement used here is similar enough to the shell printf command that its meaning should be clear: print the first colon-separated field left-adjusted in a 39-character field, followed by a tab, the second field, another tab, and the third field. Here are the full results:

  Franklin, Ben             •OSD212•555-0022
  Gale, Dorothy             •KNS321•555-0044
  Gale, Toto                •KNS322•555-0045
  Hancock, John             •SIG435•555-0099
  Jefferson, Thomas         •BMD19•555-0095
  Jones, Adrian W.          •OSD211•555-0123
  Ross, Betsy               •BMD17•555-0033
  Washington, George        •BST999•555-0001

That is all there is to it! Our entire script is slightly more than 20 lines long, excluding comments, with five main processing steps. We collect it together in one place in Example 5-1.

Example 5-1. Creating an office directory

#! /bin/sh
# Filter an input stream formatted like /etc/passwd,
# and output an office directory derived from that data.
#
# Usage:
#       passwd-to-directory < /etc/passwd > office-directory-file
#       ypcat passwd | passwd-to-directory > office-directory-file
#       niscat passwd.org_dir | passwd-to-directory > office-directory-file

umask 077

PERSON=/tmp/pd.key.person.$$ OFFICE=/tmp/pd.key.office.$$ TELEPHONE=/tmp/pd.key.telephone.$$ USER=/tmp/pd.key.user.$$

trap "exit 1"         HUP INT PIPE QUIT TERM trap "rm -f $PERSON $OFFICE $TELEPHONE $USER" EXIT

awk -F: '{ print $1 ":" $5 }' > $USER

sed -e 's=/.*==' \
    -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' < $USER | sort > $PERSON

sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' < $USER | sort > $OFFICE

sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)=\1:\2=' < $USER | sort > $TELEPHONE

join -t: $PERSON $OFFICE |
    join -t: - $TELEPHONE |
        cut -d: -f 2- |
            sort -t: -k1,1 -k2,2 -k3,3 |
                awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

The real power of shell scripting shows itself when we want to modify the script to do a slightly different job, such as insertion of the job title from a separately
maintained key:jobtitle file. All that we need to do is modify the final pipeline to look something like this:

join -t: $PERSON /etc/passwd.job-title | Extra join with job title
  join -t: - $OFFICE |
    join -t: - $TELEPHONE |
      cut -d: -f 2- |
        sort -t: -k1,1 -k3,3 -k4,4 | 
 Modify sort command
                  
awk -F: '{ printf("%-39s\t%-23s\t%s\t%s\n",
           $1, $2, $3, $4) }'
    And formatting command

The total cost for the extra directory field is one more join, a change in the sort fields, and a small tweak in the final awk formatting command.

Because we were careful to preserve special field delimiters in our output, we can trivially prepare useful alternative directories like this:

  passwd-to-directory < /etc/passwd | sort -t'•' -k2,2 > dir.by-office
  passwd-to-directory < /etc/passwd | sort -t'•' -k3,3 > dir.by-telephone

As usual, represents an ASCII tab character.

A critical assumption of our program is that there is a unique key for each data record. With that unique key, separate views of the data can be maintained in files as key:value pairs. Here, the key was a Unix username, but in larger contexts, it could be a book number (ISBN), credit card number, employee number, national retirement system number, part number, student number, and so on. Now you know why we get so many numbers assigned to us! You can also see that those handles need not be numbers: they just need to be unique text strings.



 
 
>>> More BrainDump Articles          >>> More By O'Reilly Media
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

BRAINDUMP ARTICLES

- Apple Founder Steve Jobs Dies
- Steve Jobs` Era at Apple Ends
- Google's Chrome Developer Tool Updated
- Google's Chrome 6 Browser Brings Speed to th...
- New Open Source Update Fedora 13 is Released...
- Install Linux with Knoppix
- iPad Developers Flock To SDK 3.2
- Managing a Linux Wireless Access Point
- Maintaining a Linux Wireless Access Point
- Securing a Linux Wireless Access Point
- Configuring a Linux Wireless Access Point
- Building a Linux Wireless Access Point
- Migrating Oracle to PostgreSQL with Enterpri...
- Demystifying SELinux on Kernel 2.6
- Yahoo and Microsoft Create Ad Partnership

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: