BrainDump
  Home arrow BrainDump arrow Page 2 - More Amazing Things to Do With Pipelines
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
BRAINDUMP

More Amazing Things to Do With Pipelines
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 4
    2008-07-02


    Table of Contents:
  • More Amazing Things to Do With Pipelines
  • 5.4 Word Lists
  • Word Lists, continued
  • 5.5 Tag Lists
  • 5.6 Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    More Amazing Things to Do With Pipelines - 5.4 Word Lists
    ( Page 2 of 5 )

    From 1983 to 1987, Bell Labs researcher Jon Bentley wrote an interesting column in Communications of the ACM titled Programming Pearls. Some of the columns were later collected, with substantial changes, into two books listed in the Bibliography. In one of the columns, Bentley posed this challenge: write a program to process a text file, and output a list of the n  most-frequent words, with counts of their frequency of occurrence, sorted by descending count. Noted computer scientists Donald Knuth and David Hanson responded separately with interesting and clever literate programs,* each of which took several hours to write. Bentley’s original specification was imprecise, so Hanson rephased it this way: Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n  largest in order of decreasing frequency.

    In the first of Bentley’s articles, fellow Bell Labs researcher Doug McIlroy reviewed Knuth’s program, and offered a six-step Unix solution that took only a couple of minutes to develop and worked correctly the first time. Moreover, unlike the two other programs, McIlroy’s is devoid of explicit magic constants that limit the word lengths, the number of unique words, and the input file size. Also, its notion of what constitutes a word is defined entirely by simple patterns given in its first two executable statements, making changes to the word-recognition algorithm easy.

    McIlroy’s program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).

    The resulting program is worth being given a name (wf, for word frequency) and wrapped in a shell script with a comment header. We also extend McIlroy’s original sed command to make the output list-length argument optional, and we modernize the sort options. We show the complete program in Example 5-5.

    Example 5-5. Word-frequency filter

    #! /bin/sh
    # Read a text stream on standard input, and output a list of
    # the n (default: 25) most frequently occurring words and
    # their frequency counts, in order of descending counts, on
    # standard output.
    #
    # Usage:
    #   wf [n]

    tr -cs A-Za-z\' '\n' |  Replace nonletters with newlines
     tr A-Z a-z |   Map uppercase to lowercase
      sort |   Sort the words in ascending order
       uniq -c |   Eliminate duplicates, showing their counts
        sort -k1,1nr -k2 |  Sort by descending count, and then by ascending word
         sed ${1:-25}q   Print only the first n (default: 25) lines; see Chapter 3

    POSIX tr supports all of the escape sequences of ISO Standard C. The older X/Open Portability Guide specification only had octal escape sequences, and the original tr had none at all, forcing the newline to be written literally, which was one of the criticisms levied at McIlroy’s original program. Fortunately, the tr command on every system that we tested now has the POSIX escape sequences.

    A shell pipeline isn’t the only way to solve this problem with Unix tools: Bentley gave a six-line awk implementation of this program in an earlier column* that is roughly equivalent to the first four stages of McIlroy’s pipeline.

    Knuth and Hanson discussed the computational complexity of their programs, and Hanson used runtime profiling to investigate several variants of his program to find the fastest one.

    The complexity of McIlroy’s is easy to identify. All but the sort stages run in a time that is linear in the size of their input, and that size is usually sharply reduced after the uniq stage. Thus, the rate-limiting step is the first sort. A good sorting algorithm based on comparisons, like that in Unix sort, can sort items in a time proportional to n log2n. The logarithm-to-the-base-2 factor is small: for n about 1 million, it is about 20. Thus, in practice, we expect wf to be a few times slower than it would take to just copy its input stream with cat.

    Here is an example of applying this script to the text of Shakespeare’s most popular play, Hamlet,† reformatting the output with pr to a four-column display:

      $ wf 12 < hamlet | pr -c4 -t -w80

     1148 the 671 of 550 a 451 in
    970 and 635 i 514 my 419 it
    771 to 554 you 494 hamlet 407 that



     
     
    >>> More BrainDump Articles          >>> More By O'Reilly Media
     

       

    BRAINDUMP ARTICLES

    - Demystifying SELinux on Kernel 2.6
    - Yahoo and Microsoft Create Ad Partnership
    - The Advantages of Obscure Open Source Browse...
    - Dell Announces CSI-style Digital Forensics S...
    - Milepost GCC Speeds Open-Source Development
    - Learn These 10 Programming Languages
    - Tomcat Capacity Planning
    - Internal and External Performance Tuning wit...
    - Tomcat Benchmark Procedure
    - Benchmarking Tomcat Performance
    - Tomcat Performance Tuning
    - Wubi: Windows-based Ubuntu Installer
    - Configuring and Optimizing Your I/O Scheduler
    - Linux I/O Schedulers
    - Advising the Linux Kernel on File I/O





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 2 Hosted by Hostway
    Stay green...Green IT