More Amazing Things to Do With Pipelines

In this second part of a two-part series on pipelines in Unix, you will learn some fun ways to cheat at word puzzles and other, more useful tricks. This article is excerpted from chapter 5 of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O’Reilly; ISBN: 0596005954). Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

5.3 Cheating at Word Puzzles

Crossword puzzles give you clues about words, but most of us get stuck when we cannot think of, say, a ten-letter word that begins with a b and has either an x or a z in the seventh position.

Regular-expression pattern matching with awk or grep is clearly called for, but what files do we search? One good choice is the Unix spelling dictionary, available as /usr/dict/words, on many systems. (Other popular locations for this file are /usr/share/dict/words and /usr/share/lib/dict/words.) This is a simple text file, with one word per line, sorted in lexicographic order. We can easily create other similar-appearing files from any collection of text files, like this:

  cat file(s) | tr A-Z a-z | tr -c a-z’ ‘n’ | sort -u

The second pipeline stage converts uppercase to lowercase, the third replaces nonletters by newlines, and the last sorts the result, keeping only unique lines. The third stage treats apostrophes as letters, since they are used in contractions. Every Unix system has collections of text that can be mined in this way—for example, the formatted manual pages in /usr/man/cat*/* and /usr/local/man/cat*/*. On one of our systems, they supplied more than 1 million lines of prose and produced a list of about 44,000 unique words. There are also word lists for dozens of languages in various Internet archives.*

Let us assume that we have built up a collection of word lists in this way, and we stored them in a standard place that we can reference from a script. We can then write the program shown in Example 5-4.

Example 5-4. Word puzzle solution helper

#! /bin/sh
# Match an egrep(1)-like pattern against a collection of
# word lists.
# Usage:
#   puzzle-help egrep-pattern [word-list-files]


egrep -h -i "$pattern" $FILES 2> /dev/null | sort -u -f

The FILES variable holds the built-in list of word-list files, customized to the local site. The grep option –h suppresses filenames from the report, the –i option ignores lettercase, and we discard the standard error output with 2> /dev/null, in case any of the word-list files don’t exist or they lack the necessary read permission. (This kind of redirection is described in “File Descriptor Manipulation” [7.3.2].) The final sort stage reduces the report to just a list of unique words, ignoring lettercase.

Now we can find the word that we were looking for:

  $ Puzzle-help ‘^b…..[xz]…$’ | fmt
bamboozled Bamboozler bamboozles bdDenizens bdWheezing Belshazzar
  botanizing Brontozoum Bucholzite bulldozing

Can you think of an English word with six consonants in a row? Here’s some help:

  $ puzzle-help ‘[^aeiouy]{6}’ /usr/dict/words

If you don’t count as a vowel, many more turn up: encryption, klystron, porphyry, syzygy, and so on.

We could readily exclude the contractions from the word lists by a final filter step—egrep -i ‘^[a-z]+$’—but there is little harm in leaving them in the word lists.

{mospagebreak title=5.4 Word Lists}

From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books listed in the Bibliography. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n  most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs,* each of which took several hours to write. Bentley’s original specification was imprecise, so Hanson rephased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n  largest in order of decreasing frequency.

In the first of Bentley’s articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth’s program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy’s is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.

McIlroy’s program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).

The resulting program is worth being given a name (wf, for word frequency) and wrapped in a shell script with a comment header. We also extend McIlroy’s original sed command to make the output list-length argument optional, and we modernize the sort options. We show the complete program in Example 5-5.

Example 5-5. Word-frequency filter

#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
# Usage:
#   wf [n]

tr -cs A-Za-z’ ‘n’ |  Replace nonletters with newlines
 tr A-Z a-z |   Map uppercase to lowercase
  sort |   Sort the words in ascending order
   uniq -c |   Eliminate duplicates, showing their counts
    sort -k1,1nr -k2 |  Sort by descending count, and then by ascending word
     sed ${1:-25}q   Print only the first n (default: 25) lines; see Chapter 3

POSIX tr supports all of the escape sequences of ISO Standard C. The older X/Open Portability Guide specification only had octal escape sequences, and the original tr had none at all, forcing the newline to be written literally, which was one of the criticisms levied at McIlroy’s original program. Fortunately, the tr command on every system that we tested now has the POSIX escape sequences.

A shell pipeline isn’t the only way to solve this problem with Unix tools: Bentley gave a six-line awk implementation of this program in an earlier column* that is roughly equivalent to the first four stages of McIlroy’s pipeline.

Knuth and Hanson discussed the computational complexity of their programs, and Hanson used runtime profiling to investigate several variants of his program to find the fastest one.

The complexity of McIlroy’s is easy to identify. All but the sort stages run in a time that is linear in the size of their input, and that size is usually sharply reduced after the uniq stage. Thus, the rate-limiting step is the first sort. A good sorting algorithm based on comparisons, like that in Unix sort, can sort items in a time proportional to n log2n. The logarithm-to-the-base-2 factor is small: for n about 1 million, it is about 20. Thus, in practice, we expect wf to be a few times slower than it would take to just copy its input stream with cat.

Here is an example of applying this script to the text of Shakespeare’s most popular play, Hamlet,† reformatting the output with pr to a four-column display:

  $ wf 12 < hamlet | pr -c4 -t -w80

 1148 the 671 of 550 a 451 in
970 and 635 i 514 my 419 it
771 to 554 you 494 hamlet 407 that

{mospagebreak title=Word Lists, continued}

The results are about as expected for English prose. More interesting, perhaps, is to ask how many unique words there are in the play:

  $ wf 999999 < hamlet | wc -l

and to look at some of the least-frequent words:

$ wf 999999 < hamlet | tail -n 12 | pr -c4 -t -w80

 1 yaw 1 yesterday 1 yielding 1 younger
1 yawn 1 yesternight 1 yon 1 yourselves
1 yeoman 1 yesty 1 yond 1 zone

There is nothing magic about the argument 999999: it just needs to be a number larger than any expected count of unique words, and the keyboard repeat feature makes it easy to type.

We can also ask how many of the 4548 unique words were used just once:

  $ wf 999999 < hamlet | grep -c ‘^ *1•’

The following the digit 1 in the grep pattern represents a tab. This result is surprising, and probably atypical of most modern English prose: although the play’s vocabulary is large, nearly 58 percent of the words occur only once. And yet, the core vocabulary of frequently occurring words is rather small:

  $ wf 999999 < hamlet | awk ‘$1 >= 5′ | wc -l

This is about the number of words that a student might be expected to learn in a semester course on a foreign language, or that a child learns before entering school.

Shakespeare didn’t have computers to help analyze his writing,* but we can speculate that part of his genius was in making most of what he wrote understandable to the broadest possible audience of his time.

When we applied wf to the individual texts of Shakespeare’s plays, we found that Hamlet has the largest vocabulary (4548), whereas Comedy of Errors has the smallest (2443). The total number of unique words in the Shakespeare corpus of plays and sonnets is nearly 23,700, which shows that you need exposure to several plays to enjoy the richness of his work. About 36 percent of those words are used only once, and only one word begins with x: Xanthippe, in Taming of the Shrew. Clearly, there is plenty of fodder in Shakespeare for word-puzzle enthusiasts and vocabulary analysts!

{mospagebreak title=5.5 Tag Lists}

Use of the tr command to obtain lists of words, or more generally, to transform one set of characters to another set, as in Example 5-5 in the preceding section, is a handy Unix tool idiom to remember. It leads naturally to a solution of a problem that we had in writing this book: how do we ensure consistent markup through about 50K lines of manuscript files? For example, a command might be marked up with <command>tr</command> when we talk about it in the running text, but elsewhere, we might give an example of something that you type, indicated by the markup <literal>tr</literal>. A third possibility is a manual-page reference in the form <emphasis>tr</emphasis>(1).

The taglist program in Example 5-6 provides a solution. It finds all begin/end tag pairs written on the same line and outputs a sorted list that associates tag use with input files. Additionally, it flags with an arrow cases where the same word is marked up in more than one way. Here is a fragment of its output from just the file for a version of this chapter:

  $ taglist ch05.xml
    2 cut      command     ch05.xml
    1 cut      emphasis    ch05.xml <—-
    2 uniq     command     ch05.xml
    1 uniq     emphasis    ch05.xml <—-
    1 vfstab   filename    ch05.xml

The tag listing task is reasonably complex, and would be quite hard to do in most conventional programming languages, even ones with large class libraries, such as C++ and Java, and even if you started with the Knuth or Hanson literate programs for the somewhat similar word-frequency problem. Yet, just nine steps in a Unix pipeline with by-now familiar tools suffice.

The word-frequency program did not deal with named files: it just assumed a single data stream. That is not a serious limitation because we can easily feed it multiple input files with cat. Here, however, we need a filename, since it does us no good to report a problem without telling where the problem is. The filename is taglist’s single argument, available in the script as $1.

  1. We feed the input file into the pipeline with cat. We could, of course, eliminate this step by redirecting the input of the next stage from $1, but we find in complex pipelines that it is clearer to separate data production from data processing. It also makes it slightly easier to insert yet another stage into the pipeline if the program later evolves.

      cat "$1" | …
  2. We apply sed to simplify the otherwise-complex markup needed for web URLs:

      … | sed -e ‘s#systemitem *role="url"#URL#g’
        -e ‘s#/systemitem#/URL#’ | …

    This converts tags such as <systemitem role="URL"> and </systemitem> into simpler <URL> and </URL> tags, respectively.
  3. The next stage uses tr to replace spaces and paired delimiters by newlines:

      … | tr ‘ (){}[]‘ ‘nnnnnnn’ | …
  4. At this point, the input consists of one “word” per line (or empty lines). Words are either actual text or SGML/XML tags. Using egrep, the next stage selects tag-enclosed words:

      … | egrep ‘>[^<>]+</’ | …

    This regular expression matches tag-enclosed words: a right angle bracket, followed by at least one nonangle bracket, followed by a left angle bracket, followed by a slash (for the closing tag). 
  5. At this point, the input consists of lines with tags. The first awk stage uses angle brackets as field separators, so the input <literal>tr</literal> is split into four fields: an empty field, followed by literal, tr, and /literal. The filename is passed to awk on the command line, where the –v option sets the awk variable FILE to the filename. That variable is then used in the print statement, which outputs the word, the tag, and the filename:

      … | awk -F’[<>]‘ -v FILE="$1"  
        ‘{ printf("%-31st%-15st%sn", $3, $2, FILE) }’ | …

  6. The sort stage sorts the lines into word order:

      … | sort | …
  7. The uniq command supplies the initial count field. The output is a list of records, where the fields are count, word, tag, file:

      … | uniq -c | …
  8. A second sort orders the output by word and tag (the second and third fields):

      … | sort -k2,2 -k3,3 | …
  9. The final stage uses a small awk program to filter successive lines, adding a trailing arrow when it sees the same word as on the previous line. This arrow then clearly indicates instances where words have been marked up differently, and thus deserve closer inspection by the authors, the editors, or the book-production staff:

      … | awk ‘{
        print ($2 == Last) ? ($0 " <—-") : $0
        Last = $2

The full program is provided in Example 5-6.

Example 5-6. Making an SGML tag list

#! /bin/sh -
# Read an HTML/SGML/XML file given on the command
# line containing markup like <tag>word</tag> and output on
# standard output a tab-separated list of
#   count word tag filename
# sorted by ascending word and tag.
# Usage:
#   taglist xml-file

cat "$1" |
  sed -e ‘s#systemitem *role="url"#URL#g’ -e ‘s#/systemitem#/URL#’ |
    tr ‘ (){}[]‘ ‘nnnnnnn’ |
      egrep ‘>[^<>]+</’ |
        awk -F’[<>]‘ -v FILE="$1"
          ‘{ printf("%-31st%-15st%sn", $3, $2, FILE) }’ |
         sort |
          uniq -c |
           sort -k2,2 -k3,3 |
awk ‘{
              print ($2 == Last) ? ($0 " <—-") : $0
              Last = $2

In “Functions” [6.5], we will show how to apply the tag-list operation to multiple files.

{mospagebreak title=5.6 Summary}

This chapter has shown how to solve several text processing problems, none of which would be simple to do in most programming languages. The critical lessons of this chapter are:

  1. Data markup is extremely valuable, although it need not be complex. A unique single character, such as a tab, colon, or comma, often suffices.
  2. Pipelines of simple Unix tools and short, often inline, programs in a suitable text processing language, such as awk, can exploit data markup to pass multiple pieces of data through a series of processing stages, emerging with a useful report.
  3. By keeping the data markup simple, the output of our tools can readily become input to new tools, as shown by our little analysis of the output of the word-frequency filter, wf, applied to Shakespeare’s texts.
  4. By preserving some minimal markup in the output, we can later come back and massage that data further, as we did to turn a simple ASCII office directory into a web page. Indeed, it is wise never to consider any form of electronic data as final: there is a growing demand in some quarters for page-description languages, such as PCL, PDF, and PostScript, to preserve the original markup that led to the page formatting. Word processor documents currently are almost devoid of useful logical markup, but that may change in the future. At the time of this writing, one prominent word processor vendor was reported to be considering an XML representation for document storage. The GNU Project’s gnumeric spreadsheet, the Linux Documentation Project,* and the† office suite already do that. 
  5. Lines with delimiter-separated fields are a convenient format for exchanging data with more complex software, such as spreadsheets and databases. Although such systems usually offer some sort of report-generation feature, it is often easier to extract the data as a stream of lines of fields, and then to apply filters written in suitable programming languages to manipulate the data further. For example, catalog and directory publishing are often best done this way.



* On some systems, file formats are in Section 7; thus, you might need to use man 7 passwd instead.

* In addition to this book (listed in the Bibliography), hundreds of books on SGML and derivatives are listed at and

a E. F. Codd, A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, 13(6) 377–387, June (1970), and Relational Database: A Practical Foundation for Productivity, Communications of the ACM, 25(2) 109–117, February (1982) (Turing Award lecture).

b By Kevin Kline and Daniel Kline, O’Reilly & Associates, 2000, ISBN 1-56592-744-3. See also
sqlbooks.html for an extensive list of SQL books.

* Available at

* Available at,,
and A search for “word list” in any Internet search engine turns up many more.

* Programming Pearls: A Literate Program: A WEB program for common words, Comm. ACM 29(6), 471–483, June (1986), and Programming Pearls: Literate Programming: Printing Common Words, 30(7), 594–599, July (1987). Knuth’s paper is also reprinted in his book Literate Programming, Stanford University Center for the Study of Language and Information, 1992, ISBN 0-937073-80-6 (paper) and 0-937073-81-4 (cloth).

* Programming Pearls: Associative Arrays, Comm. ACM 28(6), 570–576, June, (1985). This is an excellent introduction to the power of associative arrays (tables indexed by strings, rather than integers), a common
feature of most scripting languages.

† Available in the wonderful Project Gutenberg archives at

* Indeed, the only word related to the root of computer that Shakespeare used is computation, just once in each of two plays, Comedy of Errors and King Richard III. “Arithmetic” occurs six times in his plays, “calculate” twice, and “mathematics” thrice.

* See

† See

Google+ Comments

Google+ Comments