BrainDump
  Home arrow BrainDump arrow Page 2 - Pipelines Can Do Amazing Things
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
Google.com  
BRAINDUMP

Pipelines Can Do Amazing Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 3
    2008-06-26


    Table of Contents:
  • Pipelines Can Do Amazing Things
  • Extracting Data from Structured Text Files, continued
  • 5.2 Structured Data for the Web
  • Structured Data for the Web, continued

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Pipelines Can Do Amazing Things - Extracting Data from Structured Text Files, continued
    ( Page 2 of 4 )

     

    Since the password file is publicly readable, any data derived from it is public as well, so there is no real need to restrict access to our program’s intermediate files. However, because all of us at times have to deal with sensitive data, it is good to develop the programming habit of allowing file access only to those users or processes that need it. We therefore reset the umask (see “Default permissions” in Appendix B) as the first action in our program:

      umask 077                                Restrict temporary file access to just us

    For accountability and debugging, it is helpful to have some commonality in temporary filenames, and to avoid cluttering the current directory with them: we name them with the prefix /tmp/pd.. To guard against name collisions if multiple instances of our program are running at the same time, we also need the names to be unique: the process number, available in the shell variable $$, provides a distinguishing suffix. (This use of $$ is described in more detail in Chapter 10.) We therefore define these shell variables to represent our temporary files:

      PERSON=/tmp/pd.key.person.$$                 Unique temporary filenames   
      OFFICE=/tmp/pd.key.office.$$
      TELEPHONE=/tmp/pd.key.telephone.$$
      USER=/tmp/pd.key.user.$$

    When the job terminates, either normally or abnormally, we want the temporary files to be deleted, so we use the trap command:

      trap "exit 1"       HUP INT PIPE QUIT TERM
      trap "rm -f $PERSON $OFFICE $TELEPHONE $USER" EXIT

    During development, we can just comment out the second trap, preserving temporary files for subsequent examination. (The trap command is described in “Trapping Process Signals” [13.3.2]. For now, it’s enough to understand that when the script exits, the trap command arranges to automatically run rm with the given arguments.)

    We need fields one and five repeatedly, and once we have them, we don’t require the input stream from standard input again, so we begin by extracting them into a temporary file:

      awk -F: '{ print $1 ":" $5 }' >
    $USER
                                           This reads standard input

    We make the key:person pair file first, with a two-step sed program followed by a simple line sort; the sort command is discussed in detail in “Sorting Text” [4.1].

      sed -e 's=/.*==' \
          -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' <$USER | sort >$PERSON

    The script uses = as the separator character for sed’s s command, since both slashes and colons appear in the data. The first edit strips everything from the first slash to the end of the line, reducing a line like this:

      jones:Adrian W. Jones/OSD211/
    555-0123  
                                                      Input line

    to this:

      jones:Adrian W. Jones               Result of first edit

    The second edit is more complex, matching three subpatterns in the record. The first part, ^\([^:]*\), matches the username field (e.g., jones). The second part, \(.*\)❒, matches text up to a space (e.g., Adrian❒W.❒; the stands for a space character). The last part, \([^❒]*\), matches the remaining nonspace text in the record (e.g., Jones). The replacement text reorders the matches, producing something like Jones, ❒Adrian W. The result of this single sed command is the desired reordering:

    jones:Jones, Adrian W.          Printed result of second edit

    Next, we make the key:office pair file:

      sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' < $USER | sort > $OFFICE

    The result is a list of users and offices:

      jones:OSD211

    The key:telephone pair file creation is similar: we just need to adjust the match pattern:

      sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)=\1:\2=' < $USER | sort > $TELEPHONE

    At this stage, we have three separate files, each of which is sorted. Each file consists of the key (the username), a colon, and the particular data (personal name, office, telephone number). The $PERSON file’s contents look like this:

      ben:Franklin, Ben
      betsy:Ross, Betsy
      ...

    The $OFFICE file has username and office data:

      ben:OSD212
      betsy:BMD17
      ...

    The $TELEPHONE file records usernames and telephone numbers:

      ben:555-0022
     
    betsy:555-0033
      ...

    By default, join outputs the common key, then the remaining fields of the line from the first file, followed by the remaining fields of the line from the second line. The common key defaults to the first field, but that can be changed by a command-line option: we don’t need that feature here. Normally, spaces separate fields for join, but we can change the separator with its –t option: we use it as –t:.

    The join operations are done with a five-stage pipeline, as follows:

    1. Combine the personal information and the office location:

        join -t: $PERSON $OFFICE | ...

      The results of this operation, which become the input to the next stage, look like this:

        ben:Franklin, Ben:OSD212
        betsy:Ross, Betsy:BMD17
        ...
       
    2. Add the telephone number:

        ... | join -t: - $TELEPHONE | ...

      The results of this operation, which become the input to the next stage, look like this:

        ben:Franklin, Ben:OSD212:555-0022
        betsy:Ross, Betsy:BMD17:555-0033
        ...
       
    3. Remove the key (which is the first field), since it’s no longer needed. This is most easily done with cut and a range that says “use fields two through the end,” like so:

        ... | cut -d: -f 2- | ...

      The results of this operation, which become the input to the next stage, look like this:

        Franklin, Ben:OSD212:555-0022
        Ross, Betsy:BMD17:555-0033
        ...
       
    4. Re-sort the data. The data was previously sorted by login name, but now things need to be sorted by personal last name. This is done with sort

        ... | sort -t: -k1,1 -k2,2 -k3,3 | ...

      This command uses a colon to separate fields, sorting on fields 1, 2, and 3, in order. The results of this operation, which become the input to the next stage, look like this:

        Franklin, Ben:OSD212:555-0022
        Gale, Dorothy:KNS321:555-0044
        ... 
    5. Finally, reformat the output, using awk’s printf statement to separate each field with tab characters. The command to do this is:

        ... | awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

      For flexibility and ease of maintenance, formatting should always be left until the end. Up to that point, everything is just text strings of arbitrary length.

    Here’s the complete pipeline:

      join -t: $PERSON $OFFICE |
          join -t: - $TELEPHONE |
              cut -d: -f 2- |
                  sort -t: -k1,1 -k2,2 -k3,3 |
                      awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

    The awk printf statement used here is similar enough to the shell printf command that its meaning should be clear: print the first colon-separated field left-adjusted in a 39-character field, followed by a tab, the second field, another tab, and the third field. Here are the full results:

      Franklin, Ben             •OSD212•555-0022
      Gale, Dorothy             •KNS321•555-0044
      Gale, Toto                •KNS322•555-0045
      Hancock, John             •SIG435•555-0099
      Jefferson, Thomas         •BMD19•555-0095
      Jones, Adrian W.          •OSD211•555-0123
      Ross, Betsy               •BMD17•555-0033
      Washington, George        •BST999•555-0001

    That is all there is to it! Our entire script is slightly more than 20 lines long, excluding comments, with five main processing steps. We collect it together in one place in Example 5-1.

    Example 5-1. Creating an office directory

    #! /bin/sh
    # Filter an input stream formatted like /etc/passwd,
    # and output an office directory derived from that data.
    #
    # Usage:
    #       passwd-to-directory < /etc/passwd > office-directory-file
    #       ypcat passwd | passwd-to-directory > office-directory-file
    #       niscat passwd.org_dir | passwd-to-directory > office-directory-file

    umask 077

    PERSON=/tmp/pd.key.person.$$ OFFICE=/tmp/pd.key.office.$$ TELEPHONE=/tmp/pd.key.telephone.$$ USER=/tmp/pd.key.user.$$

    trap "exit 1"         HUP INT PIPE QUIT TERM trap "rm -f $PERSON $OFFICE $TELEPHONE $USER" EXIT

    awk -F: '{ print $1 ":" $5 }' > $USER

    sed -e 's=/.*==' \
        -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' < $USER | sort > $PERSON

    sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' < $USER | sort > $OFFICE

    sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)=\1:\2=' < $USER | sort > $TELEPHONE

    join -t: $PERSON $OFFICE |
        join -t: - $TELEPHONE |
            cut -d: -f 2- |
                sort -t: -k1,1 -k2,2 -k3,3 |
                    awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

    The real power of shell scripting shows itself when we want to modify the script to do a slightly different job, such as insertion of the job title from a separately
    maintained key:jobtitle file. All that we need to do is modify the final pipeline to look something like this:

    join -t: $PERSON /etc/passwd.job-title | Extra join with job title
      join -t: - $OFFICE |
        join -t: - $TELEPHONE |
          cut -d: -f 2- |
            sort -t: -k1,1 -k3,3 -k4,4 | 
     Modify sort command
                      
    awk -F: '{ printf("%-39s\t%-23s\t%s\t%s\n",
               $1, $2, $3, $4) }'
        And formatting command

    The total cost for the extra directory field is one more join, a change in the sort fields, and a small tweak in the final awk formatting command.

    Because we were careful to preserve special field delimiters in our output, we can trivially prepare useful alternative directories like this:

      passwd-to-directory < /etc/passwd | sort -t'•' -k2,2 > dir.by-office
      passwd-to-directory < /etc/passwd | sort -t'•' -k3,3 > dir.by-telephone

    As usual, represents an ASCII tab character.

    A critical assumption of our program is that there is a unique key for each data record. With that unique key, separate views of the data can be maintained in files as key:value pairs. Here, the key was a Unix username, but in larger contexts, it could be a book number (ISBN), credit card number, employee number, national retirement system number, part number, student number, and so on. Now you know why we get so many numbers assigned to us! You can also see that those handles need not be numbers: they just need to be unique text strings.



     
     
    >>> More BrainDump Articles          >>> More By O'Reilly Media
     

       

    BRAINDUMP ARTICLES

    - Replacing Oracle with PostgreSQL
    - Demystifying SELinux on Kernel 2.6
    - Yahoo and Microsoft Create Ad Partnership
    - The Advantages of Obscure Open Source Browse...
    - Dell Announces CSI-style Digital Forensics S...
    - Milepost GCC Speeds Open-Source Development
    - Learn These 10 Programming Languages
    - Tomcat Capacity Planning
    - Internal and External Performance Tuning wit...
    - Tomcat Benchmark Procedure
    - Benchmarking Tomcat Performance
    - Tomcat Performance Tuning
    - Wubi: Windows-based Ubuntu Installer
    - Configuring and Optimizing Your I/O Scheduler
    - Linux I/O Schedulers





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek