BrainDump
  Home arrow BrainDump arrow Page 4 - More Amazing Things to Do With Pipelines
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
BRAINDUMP

More Amazing Things to Do With Pipelines
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 4
    2008-07-02


    Table of Contents:
  • More Amazing Things to Do With Pipelines
  • 5.4 Word Lists
  • Word Lists, continued
  • 5.5 Tag Lists
  • 5.6 Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    More Amazing Things to Do With Pipelines - 5.5 Tag Lists
    ( Page 4 of 5 )

    Use of the tr command to obtain lists of words, or more generally, to transform one set of characters to another set, as in Example 5-5 in the preceding section, is a handy Unix tool idiom to remember. It leads naturally to a solution of a problem that we had in writing this book: how do we ensure consistent markup through about 50K lines of manuscript files? For example, a command might be marked up with <command>tr</command> when we talk about it in the running text, but elsewhere, we might give an example of something that you type, indicated by the markup <literal>tr</literal>. A third possibility is a manual-page reference in the form <emphasis>tr</emphasis>(1).

    The taglist program in Example 5-6 provides a solution. It finds all begin/end tag pairs written on the same line and outputs a sorted list that associates tag use with input files. Additionally, it flags with an arrow cases where the same word is marked up in more than one way. Here is a fragment of its output from just the file for a version of this chapter:

      $ taglist ch05.xml
      ...
        2 cut      command     ch05.xml
        1 cut      emphasis    ch05.xml <----
      ...
        2 uniq     command     ch05.xml
        1 uniq     emphasis    ch05.xml <----
        1 vfstab   filename    ch05.xml
      ...

    The tag listing task is reasonably complex, and would be quite hard to do in most conventional programming languages, even ones with large class libraries, such as C++ and Java, and even if you started with the Knuth or Hanson literate programs for the somewhat similar word-frequency problem. Yet, just nine steps in a Unix pipeline with by-now familiar tools suffice.

    The word-frequency program did not deal with named files: it just assumed a single data stream. That is not a serious limitation because we can easily feed it multiple input files with cat. Here, however, we need a filename, since it does us no good to report a problem without telling where the problem is. The filename is taglist’s single argument, available in the script as $1.

    1. We feed the input file into the pipeline with cat. We could, of course, eliminate this step by redirecting the input of the next stage from $1, but we find in complex pipelines that it is clearer to separate data production from data processing. It also makes it slightly easier to insert yet another stage into the pipeline if the program later evolves.

        cat "$1" | ...
    2. We apply sed to simplify the otherwise-complex markup needed for web URLs:

        ... | sed -e 's#systemitem *role="url"#URL#g' \
          -e 's#/systemitem#/URL#' | ...

      This converts tags such as <systemitem role="URL"> and </systemitem> into simpler <URL> and </URL> tags, respectively.
    3. The next stage uses tr to replace spaces and paired delimiters by newlines:

        ... | tr ' (){}[]' '\n\n\n\n\n\n\n' | ...
    4. At this point, the input consists of one “word” per line (or empty lines). Words are either actual text or SGML/XML tags. Using egrep, the next stage selects tag-enclosed words:

        ... | egrep '>[^<>]+</' | ...

      This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag). 
    5. At this point, the input consists of lines with tags. The first awk stage uses angle brackets as field separators, so the input <literal>tr</literal> is split into four fields: an empty field, followed by literal, tr, and /literal. The filename is passed to awk on the command line, where the –v option sets the awk variable FILE to the filename. That variable is then used in the print statement, which outputs the word, the tag, and the filename:

        ... | awk -F'[<>]' -v FILE="$1" \ 
          '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | ...

    6. The sort stage sorts the lines into word order:

        ... | sort | ...
    7. The uniq command supplies the initial count field. The output is a list of records, where the fields are count, word, tag, file:

        ... | uniq -c | ...
    8. A second sort orders the output by word and tag (the second and third fields):

        ... | sort -k2,2 -k3,3 | ...
    9. The final stage uses a small awk program to filter successive lines, adding a trailing arrow when it sees the same word as on the previous line. This arrow then clearly indicates instances where words have been marked up differently, and thus deserve closer inspection by the authors, the editors, or the book-production staff:

        ... | awk '{
          print ($2 == Last) ? ($0 " <----") : $0
          Last = $2
                   }'

    The full program is provided in Example 5-6.

    Example 5-6. Making an SGML tag list

    #! /bin/sh -
    # Read an HTML/SGML/XML file given on the command
    # line containing markup like <tag>word</tag> and output on
    # standard output a tab-separated list of
    #
    #   count word tag filename
    #
    # sorted by ascending word and tag.
    #
    # Usage:
    #   taglist xml-file

    cat "$1" |
      sed -e 's#systemitem *role="url"#URL#g' -e 's#/systemitem#/URL#' |
        tr ' (){}[]' '\n\n\n\n\n\n\n' |
          egrep '>[^<>]+</' |
            awk -F'[<>]' -v FILE="$1" \
              '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' |
             sort |
              uniq -c |
               sort -k2,2 -k3,3 |
               
    awk '{
                  print ($2 == Last) ? ($0 " <----") : $0
                  Last = $2
                   
    }'

    In “Functions” [6.5], we will show how to apply the tag-list operation to multiple files.



     
     
    >>> More BrainDump Articles          >>> More By O'Reilly Media
     

       

    BRAINDUMP ARTICLES

    - Demystifying SELinux on Kernel 2.6
    - Yahoo and Microsoft Create Ad Partnership
    - The Advantages of Obscure Open Source Browse...
    - Dell Announces CSI-style Digital Forensics S...
    - Milepost GCC Speeds Open-Source Development
    - Learn These 10 Programming Languages
    - Tomcat Capacity Planning
    - Internal and External Performance Tuning wit...
    - Tomcat Benchmark Procedure
    - Benchmarking Tomcat Performance
    - Tomcat Performance Tuning
    - Wubi: Windows-based Ubuntu Installer
    - Configuring and Optimizing Your I/O Scheduler
    - Linux I/O Schedulers
    - Advising the Linux Kernel on File I/O





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 3 Hosted by Hostway
    Stay green...Green IT