BrainDump
  Home arrow BrainDump arrow Pipelines Can Do Amazing Things
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
BRAINDUMP

Pipelines Can Do Amazing Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 3
    2008-06-26


    Table of Contents:
  • Pipelines Can Do Amazing Things
  • Extracting Data from Structured Text Files, continued
  • 5.2 Structured Data for the Web
  • Structured Data for the Web, continued

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Pipelines Can Do Amazing Things
    ( Page 1 of 4 )

    In this two-part series, you will learn how to handle text processing jobs in Unix with pipelines. This article is excerpted from chapter 5 of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O'Reilly; ISBN: 0596005954). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

    In this chapter, we solve several relatively simple text processing jobs. What’s interesting about all the examples here is that they are scripts built from simple pipelines: chains of one command hooked into another. Yet each one accomplishes a significant task.

    When you tackle a text processing problem in Unix, it is important to keep the Unix tool philosophy in mind: ask yourself how the problem can be broken down into simpler jobs, for each of which there is already an existing tool, or for which you can readily supply one with a few lines of a shell program or with a scripting language.

    5.1   Extracting Data from Structured Text Files

    Most administrative files in Unix are simple flat text files that you can edit, print, and read without any special file-specific tools. Many of them reside in the standard directory, /etc. Common examples are the password and group files (passwd and group), the filesystem mount table (fstab or vfstab), the hosts file (hosts), the default shell startup file (profile), and the system startup and shutdown shell scripts, stored in the subdirectory trees rc0.d, rc1.d, and so on, through rc6.d. (There may be other directories as well.)

    File formats are traditionally documented in Section 5 of the Unix manual, so the command man 5 passwd provides information about the structure of /etc/passwd.*

    Despite its name, the password file must always be publicly readable. Perhaps it should have been called the user file because it contains basic information about every user account on the system, packed together in one line per account, with fields separated by colons. We described the file’s format in “Text File Conventions” [3.3.1]. Here are some typical entries:

      jones:*:32713:899:Adrian W. Jones/OSD211/555-0123:/home/jones:/bin/ksh
      dorothy:*:123:30:Dorothy Gale/KNS321/555-0044:/home/dorothy:/bin/bash
     
    toto:*:1027:18:Toto Gale/KNS322/555-0045:/home/toto:/bin/tcsh
      ben:*:301:10:Ben Franklin/OSD212/555-0022:/home/ben:/bin/bash
      jhancock:*:1457:57:John Hancock/SIG435/555-0099:/home/jhancock:/bin/bash
      betsy:*:110:20:Betsy Ross/BMD17/555-0033:/home/betsy:/bin/ksh
      tj:*:60:33:Thomas Jefferson/BMD19/555-0095:/home/tj:/bin/bash
      george:*:692:42:George Washington/BST999/555-0001:/home/george:/bin/tcsh

    To review, the seven fields of a password-file entry are:

    1. The username
    2. The encrypted password, or an indicator that the password is stored in a separate file
    3. The numeric user ID
    4. The numeric group ID
    5. The user’s personal name, and possibly other relevant data (office number, telephone number, and so on)
    6. The home directory
    7. The login shell

    All but one of these fields have significance to various Unix programs. The one that does not is the fifth, which conventionally holds user information that is relevant only to local humans. Historically, it was called the gecos field, because it was added in the 1970s at Bell Labs when Unix systems needed to communicate with other computers running the General Electric Comprehensive Operating System, and some extra information about the Unix user was required for that system. Today, most sites use it just to record the personal name, so we simply call it the name field.

    For the purposes of this example, we assume that the local site records extra information in the name field: a building and office number identifier (OSD211 in the first sample entry), and a telephone number (555-0123), separated from the personal name by slashes.

    One obvious useful thing that we can do with such a file is to write some software to create an office directory. That way, only a single file, /etc/passwd, needs to be kept up-to-date, and derived files can be created when the master file is changed, or more sensibly, by a cron job that runs at suitable intervals. (We will discuss cron in “crontab: Rerun at Specified Times” [13.6.4].)

    For our first attempt, we make the office directory a simple text file, with entries like this:

      Franklin, Ben             •OSD212•555-0022
      Gale, Dorothy             •KNS321•555-0044
      ...

    where represents an ASCII tab character. We put the personal name in conventional directory order (family name first), padding the name field with spaces to a convenient fixed length. We prefix the office number and telephone with tab characters to preserve some useful structure that other tools can exploit.

    Scripting languages, such as awk, were designed to make such tasks easy because they provide automated input processing and splitting of input records into fields, so we could write the conversion job entirely in such a language. However, we want to show how to achieve the same thing with other Unix tools.

    For each password file line, we need to extract field five, split it into three subfields, rearrange the names in the first subfield, and then write an office directory line to a sorting process.

    awk and cut are convenient tools for field extraction:

      ... | awk -F: '{ print $5 }' | ...
      ... | cut -d: -f5 | ...

    There is a slight complication in that we have two field-processing tasks that we want to keep separate for simplicity, but we need to combine their output to make a directory entry. The join command is just what we need: it expects two input files, each ordered by a common unique key value, and joins lines sharing a common key into a single output line, with user control over which fields are output.

    Since our directory entries contain three fields, to use join we need to create three intermediate files containing the colon-separated pairs key:person, key:office, and key:telephone, one pair per line. These can all be temporary files, since they are derived automatically from the password file.

    What key do we use? It just needs to be unique, so it could be the record number in the original password file, but in this case it can also be the username, since we know that usernames are unique in the password file and they make more sense to humans than numbers do. Later, if we decide to augment our directory with additional information, such as job title, we can create another nontemporary file with the pair key:jobtitle and add it to the processing stages.

    Instead of hardcoding input and output filenames into our program, it is more flexible to write the program as a filter so that it reads standard input and writes standard output. For commands that are used infrequently, it is advisable to give them descriptive, rather than short and cryptic, names, so we start our shell program like this:

      #! /bin/sh
      # Filter an input stream formatted like /etc/passwd,
      # and output an office directory derived from that data.
      #
      # Usage:
      #       passwd-to-directory < /etc/passwd > office-directory-file
      #       ypcat passwd | passwd-to-directory > office-directory-file
      #       niscat passwd.org_dir | passwd-to-directory > office-directory-file



     
     
    >>> More BrainDump Articles          >>> More By O'Reilly Media
     

       

    BRAINDUMP ARTICLES

    - Milepost GCC Speeds Open-Source Development
    - Learn These 10 Programming Languages
    - Tomcat Capacity Planning
    - Internal and External Performance Tuning wit...
    - Tomcat Benchmark Procedure
    - Benchmarking Tomcat Performance
    - Tomcat Performance Tuning
    - Wubi: Windows-based Ubuntu Installer
    - Configuring and Optimizing Your I/O Scheduler
    - Linux I/O Schedulers
    - Advising the Linux Kernel on File I/O
    - How to Replace an Invalid Windows XP Install...
    - Using mmap() for Advanced File I/O
    - Choosing an Open-Source Content Management S...
    - The MMAP System Call in Linux





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway
    Stay green...Green IT