BrainDump
  Home arrow BrainDump arrow Page 4 - More Amazing Things to Do With Pipelin...
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Sun Developer Network 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Mobile Linux 
App Generation ROI 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
BRAINDUMP

More Amazing Things to Do With Pipelines
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 3
    2008-07-02

    Table of Contents:
  • More Amazing Things to Do With Pipelines
  • 5.4 Word Lists
  • Word Lists, continued
  • 5.5 Tag Lists
  • 5.6 Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    More Amazing Things to Do With Pipelines - 5.5 Tag Lists


    (Page 4 of 5 )

    Use of the tr command to obtain lists of words, or more generally, to transform one set of characters to another set, as in Example 5-5 in the preceding section, is a handy Unix tool idiom to remember. It leads naturally to a solution of a problem that we had in writing this book: how do we ensure consistent markup through about 50K lines of manuscript files? For example, a command might be marked up with <command>tr</command> when we talk about it in the running text, but elsewhere, we might give an example of something that you type, indicated by the markup <literal>tr</literal>. A third possibility is a manual-page reference in the form <emphasis>tr</emphasis>(1).

    The taglist program in Example 5-6 provides a solution. It finds all begin/end tag pairs written on the same line and outputs a sorted list that associates tag use with input files. Additionally, it flags with an arrow cases where the same word is marked up in more than one way. Here is a fragment of its output from just the file for a version of this chapter:

      $ taglist ch05.xml
      ...
        2 cut      command     ch05.xml
        1 cut      emphasis    ch05.xml <----
      ...
        2 uniq     command     ch05.xml
        1 uniq     emphasis    ch05.xml <----
        1 vfstab   filename    ch05.xml
      ...

    The tag listing task is reasonably complex, and would be quite hard to do in most conventional programming languages, even ones with large class libraries, such as C++ and Java, and even if you started with the Knuth or Hanson literate programs for the somewhat similar word-frequency problem. Yet, just nine steps in a Unix pipeline with by-now familiar tools suffice.

    The word-frequency program did not deal with named files: it just assumed a single data stream. That is not a serious limitation because we can easily feed it multiple input files with cat. Here, however, we need a filename, since it does us no good to report a problem without telling where the problem is. The filename is taglist’s single argument, available in the script as $1.

    1. We feed the input file into the pipeline with cat. We could, of course, eliminate this step by redirecting the input of the next stage from $1, but we find in complex pipelines that it is clearer to separate data production from data processing. It also makes it slightly easier to insert yet another stage into the pipeline if the program later evolves.

        cat "$1" | ...
    2. We apply sed to simplify the otherwise-complex markup needed for web URLs:

        ... | sed -e 's#systemitem *role="url"#URL#g' \
          -e 's#/systemitem#/URL#' | ...

      This converts tags such as <systemitem role="URL"> and </systemitem> into simpler <URL> and </URL> tags, respectively.
    3. The next stage uses tr to replace spaces and paired delimiters by newlines:

        ... | tr ' (){}[]' '\n\n\n\n\n\n\n' | ...
    4. At this point, the input consists of one “word” per line (or empty lines). Words are either actual text or SGML/XML tags. Using egrep, the next stage selects tag-enclosed words:

        ... | egrep '>[^<>]+</' | ...

      This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag). 
    5. At this point, the input consists of lines with tags. The first awk stage uses angle brackets as field separators, so the input <literal>tr</literal> is split into four fields: an empty field, followed by literal, tr, and /literal. The filename is passed to awk on the command line, where the –v option sets the awk variable FILE to the filename. That variable is then used in the print statement, which outputs the word, the tag, and the filename:

        ... | awk -F'[<>]' -v FILE="$1" \ 
          '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | ...

    6. The sort stage sorts the lines into word order:

        ... | sort | ...
    7. The uniq command supplies the initial count field. The output is a list of records, where the fields are count, word, tag, file:

        ... | uniq -c | ...
    8. A second sort orders the output by word and tag (the second and third fields):

        ... | sort -k2,2 -k3,3 | ...
    9. The final stage uses a small awk program to filter successive lines, adding a trailing arrow when it sees the same word as on the previous line. This arrow then clearly indicates instances where words have been marked up differently, and thus deserve closer inspection by the authors, the editors, or the book-production staff:

        ... | awk '{
          print ($2 == Last) ? ($0 " <----") : $0
          Last = $2
                   }'

    The full program is provided in Example 5-6.

    Example 5-6. Making an SGML tag list

    #! /bin/sh -
    # Read an HTML/SGML/XML file given on the command
    # line containing markup like <tag>word</tag> and output on
    # standard output a tab-separated list of
    #
    #   count word tag filename
    #
    # sorted by ascending word and tag.
    #
    # Usage:
    #   taglist xml-file

    cat "$1" |
      sed -e 's#systemitem *role="url"#URL#g' -e 's#/systemitem#/URL#' |
        tr ' (){}[]' '\n\n\n\n\n\n\n' |
          egrep '>[^<>]+</' |
            awk -F'[<>]' -v FILE="$1" \
              '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' |
             sort |
              uniq -c |
               sort -k2,2 -k3,3 |
               
    awk '{
                  print ($2 == Last) ? ($0 " <----") : $0
                  Last = $2
                   
    }'

    In “Functions” [6.5], we will show how to apply the tag-list operation to multiple files.

    More BrainDump Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Classic Shell Scripting," published by...
     

    Buy this book now. This article is excerpted from chapter five of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O'Reilly; ISBN: 0596005954). Check it out today at your favorite bookstore. Buy this book now.

       

    BRAINDUMP ARTICLES

    - Advanced File I/O
    - More Amazing Things to Do With Pipelines
    - Pipelines Can Do Amazing Things
    - Better Command Execution with bash
    - Executing Commands with bash
    - Outsourcing: the Hoopla, the Reality
    - MySQL Plays in the Sun
    - All About SQL Functions
    - SQL: Functioning in the Real World
    - More Advanced SQL Statements
    - Beginning SQL the SEQUEL: Working with Advan...
    - Beginning SQL
    - A Look at the VI Editor
    - A Quick Tour of Boo
    - Book Review: Open Source Licensing





    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 1 hosted by Hostway
    Stay green...Green IT