Home arrow BrainDump arrow Page 4 - More Amazing Things to Do With Pipelines

5.5 Tag Lists - BrainDump

In this second part of a two-part series on pipelines in Unix, you will learn some fun ways to cheat at word puzzles and other, more useful tricks. This article is excerpted from chapter 5 of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O'Reilly; ISBN: 0596005954). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

  1. More Amazing Things to Do With Pipelines
  2. 5.4 Word Lists
  3. Word Lists, continued
  4. 5.5 Tag Lists
  5. 5.6 Summary
By: O'Reilly Media
Rating: starstarstarstarstar / 5
July 02, 2008

print this article



Use of the tr command to obtain lists of words, or more generally, to transform one set of characters to another set, as in Example 5-5 in the preceding section, is a handy Unix tool idiom to remember. It leads naturally to a solution of a problem that we had in writing this book: how do we ensure consistent markup through about 50K lines of manuscript files? For example, a command might be marked up with <command>tr</command> when we talk about it in the running text, but elsewhere, we might give an example of something that you type, indicated by the markup <literal>tr</literal>. A third possibility is a manual-page reference in the form <emphasis>tr</emphasis>(1).

The taglist program in Example 5-6 provides a solution. It finds all begin/end tag pairs written on the same line and outputs a sorted list that associates tag use with input files. Additionally, it flags with an arrow cases where the same word is marked up in more than one way. Here is a fragment of its output from just the file for a version of this chapter:

  $ taglist ch05.xml
    2 cut      command     ch05.xml
    1 cut      emphasis    ch05.xml <----
    2 uniq     command     ch05.xml
    1 uniq     emphasis    ch05.xml <----
    1 vfstab   filename    ch05.xml

The tag listing task is reasonably complex, and would be quite hard to do in most conventional programming languages, even ones with large class libraries, such as C++ and Java, and even if you started with the Knuth or Hanson literate programs for the somewhat similar word-frequency problem. Yet, just nine steps in a Unix pipeline with by-now familiar tools suffice.

The word-frequency program did not deal with named files: it just assumed a single data stream. That is not a serious limitation because we can easily feed it multiple input files with cat. Here, however, we need a filename, since it does us no good to report a problem without telling where the problem is. The filename is taglist’s single argument, available in the script as $1.

  1. We feed the input file into the pipeline with cat. We could, of course, eliminate this step by redirecting the input of the next stage from $1, but we find in complex pipelines that it is clearer to separate data production from data processing. It also makes it slightly easier to insert yet another stage into the pipeline if the program later evolves.

      cat "$1" | ...
  2. We apply sed to simplify the otherwise-complex markup needed for web URLs:

      ... | sed -e 's#systemitem *role="url"#URL#g' \
        -e 's#/systemitem#/URL#' | ...

    This converts tags such as <systemitem role="URL"> and </systemitem> into simpler <URL> and </URL> tags, respectively.
  3. The next stage uses tr to replace spaces and paired delimiters by newlines:

      ... | tr ' (){}[]' '\n\n\n\n\n\n\n' | ...
  4. At this point, the input consists of one “word” per line (or empty lines). Words are either actual text or SGML/XML tags. Using egrep, the next stage selects tag-enclosed words:

      ... | egrep '>[^<>]+</' | ...

    This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag). 
  5. At this point, the input consists of lines with tags. The first awk stage uses angle brackets as field separators, so the input <literal>tr</literal> is split into four fields: an empty field, followed by literal, tr, and /literal. The filename is passed to awk on the command line, where the –v option sets the awk variable FILE to the filename. That variable is then used in the print statement, which outputs the word, the tag, and the filename:

      ... | awk -F'[<>]' -v FILE="$1" \ 
        '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' | ...

  6. The sort stage sorts the lines into word order:

      ... | sort | ...
  7. The uniq command supplies the initial count field. The output is a list of records, where the fields are count, word, tag, file:

      ... | uniq -c | ...
  8. A second sort orders the output by word and tag (the second and third fields):

      ... | sort -k2,2 -k3,3 | ...
  9. The final stage uses a small awk program to filter successive lines, adding a trailing arrow when it sees the same word as on the previous line. This arrow then clearly indicates instances where words have been marked up differently, and thus deserve closer inspection by the authors, the editors, or the book-production staff:

      ... | awk '{
        print ($2 == Last) ? ($0 " <----") : $0
        Last = $2

The full program is provided in Example 5-6.

Example 5-6. Making an SGML tag list

#! /bin/sh -
# Read an HTML/SGML/XML file given on the command
# line containing markup like <tag>word</tag> and output on
# standard output a tab-separated list of
#   count word tag filename
# sorted by ascending word and tag.
# Usage:
#   taglist xml-file

cat "$1" |
  sed -e 's#systemitem *role="url"#URL#g' -e 's#/systemitem#/URL#' |
    tr ' (){}[]' '\n\n\n\n\n\n\n' |
      egrep '>[^<>]+</' |
        awk -F'[<>]' -v FILE="$1" \
          '{ printf("%-31s\t%-15s\t%s\n", $3, $2, FILE) }' |
         sort |
          uniq -c |
           sort -k2,2 -k3,3 |
awk '{
              print ($2 == Last) ? ($0 " <----") : $0
              Last = $2

In “Functions” [6.5], we will show how to apply the tag-list operation to multiple files.

>>> More BrainDump Articles          >>> More By O'Reilly Media

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Apple Founder Steve Jobs Dies
- Steve Jobs` Era at Apple Ends
- Google's Chrome Developer Tool Updated
- Google's Chrome 6 Browser Brings Speed to th...
- New Open Source Update Fedora 13 is Released...
- Install Linux with Knoppix
- iPad Developers Flock To SDK 3.2
- Managing a Linux Wireless Access Point
- Maintaining a Linux Wireless Access Point
- Securing a Linux Wireless Access Point
- Configuring a Linux Wireless Access Point
- Building a Linux Wireless Access Point
- Migrating Oracle to PostgreSQL with Enterpri...
- Demystifying SELinux on Kernel 2.6
- Yahoo and Microsoft Create Ad Partnership

Developer Shed Affiliates


Dev Shed Tutorial Topics: