BrainDump
  Home arrow BrainDump arrow Page 4 - Pipelines Can Do Amazing Things
Dev Shed Forums  
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux  
App Generation ROI  
IBM® developerWorks  
Forums Sitemap  
E-Commerce Hosting  
Linux Web Hosting  
Managed Hosting  
Small Business Hosting  
VPS Hosting  
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid  
Request Media Kit
Contact Us  
Site Map  
Privacy Policy  
Support  
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
BRAINDUMP

Pipelines Can Do Amazing Things
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 3
    2008-06-26


    Table of Contents:
  • Pipelines Can Do Amazing Things
  • Extracting Data from Structured Text Files, continued
  • 5.2 Structured Data for the Web
  • Structured Data for the Web, continued

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Pipelines Can Do Amazing Things - Structured Data for the Web, continued
    ( Page 4 of 4 )

     

    Because we chose to preserve special field separators in the text version of the office directory, we have sufficient information to identify the cells in each row. Also, because whitespace is mostly not significant in HTML files (except to humans), we need not be particularly careful about getting tags nicely lined up: if that is needed later, html-pretty can do it perfectly. Our conversion filter then has three steps:

    1. Output the leading boilerplate down to the beginning of the document body.
    2. Wrap each directory row in table markup.
    3. Output the trailing boilerplate.

    We have to make one small change from our minimal example: the DOCTYPE command has to be updated to a later grammar level so that it looks like this:

      <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN//3.0">

    You don’t have to memorize this: html-pretty has options to produce output in any of the standard HTML grammar levels, so you can just copy a suitable DOCTYPE command from its output.

    Clearly, most of the work is just writing boilerplate, but that is simple since we can just copy text from the minimal HTML example. The only programmatic step required is the middle one, which we could do with only a couple of lines in awk. However, we can achieve it with even less work using a sed stream-editor substitution with two edit commands: one to substitute the embedded tab delimiters with </TD><TD>, and a following one to wrap the entire line in <TR><TD>...</TD></TR>. We temporarily assume that no accented characters are required in the directory, but we can easily allow for angle brackets and ampersands in the input stream by adding three initial sed steps. We collect the complete program in Example 5-2.

    Example 5-2. Converting an office directory to HTML

    #! /bin/sh
    # Convert a tab-separated value file to grammar-conformant HTML.
    #
    # Usage:
    #   tsv-to-html < infile > outfile

    cat << EOFILE    Leading boilerplate
    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN//3.0">
    <HTML>
      <HEAD>
       
    <TITLE>
         
    Office directory
        </TITLE>
        <LINK REV="made" HREF="mailto:$USER@`hostname`">
      </HEAD>
      <BODY>
        <TABLE>
    EOFILE

    sed -e 's=&=\&amp;=g' \     Convert special characters to entities
      -e 's=<=\&lt;=g' \
      -e 's=>=\&gt;=g' \
      -e 's=\t=</TD><TD>=g' \  
    And supply table markup
     
     -e 's=^.*$=   <TR><TD>&</TD></TR>='

    cat << EOFILE     Trailing boilerplate
                   
    </TABLE>
       
    </BODY>
    </HTML>
    EOFILE

    The << notation is called a here document. It is explained in more detail in “Additional Redirection Operators” [7.3.1]. Briefly, the shell reads all lines up to the delimiter following the << (EOFILE in this case), does variable and command substitution on the contained lines, and feeds the results as standard input to the command.

    There is an important point about the script in Example 5-2: it is independent of the number of columns in the table! This means that it can be used to convert any tab-separated value file to HTML. Spreadsheet programs can usually save data in such a format, so our simple tool can produce correct HTML from spreadsheet data.

    We were careful in tsv-to-html to maintain the spacing structure of the original office directory, because that makes it easy to apply further filters downstream. Indeed, html-pretty was written precisely for that reason: standardization of HTML markup layout radically simplifies other HTML tools.

    How would we handle conversion of accented characters to HTML entities? We could augment the sed command with extra edit steps like -e 's=é=&eacute;=g', but there are about 100 or so entities to cater for, and we are likely to need similar substitutions as we convert other kinds of text files to HTML.

    It therefore makes sense to delegate that task to a separate program that we can reuse, either as a pipeline stage following the sed command in Example 5-2, or as a filter applied later. (This is the “detour to build specialized tools” principle in action.) Such a program is just a tedious tabulation of substitution commands, and we need one for each of the local text encodings, such as the various ISO 8859-n  code pages mentioned in “How Are Files Named?” in Appendix B. We don’t show such a filter completely here, but a fragment of one in Example 5-3 gives the general flavor. For readers who need it, we include the complete program for handling the common case of Western European characters in the ISO 8859-1 encoding with this book’s sample programs. HTML’s entity repertoire isn’t sufficient for other accented characters, but since the World Wide Web is moving in the direction of Unicode and XML in place of ASCII and HTML, this problem is being solved in a different way, by getting rid of character set limitations.

    Example 5-3. Fragment of iso8859-1-to-html program

    #! /bin/sh
    # Convert an input stream containing characters in ISO 8859-1
    # encoding from the range 128..255 to HTML equivalents in ASCII.
    # Characters 0..127 are preserved as normal ASCII.
    #
    # Usage:
    #   iso8859-1-to-html infile(s) >outfile

    sed \
      -e 's= =\&nbsp;=g' \
      -e 's=¡=\&iexcl;=g' \
      -e 's=¢=\&cent;=g' \
      -e 's=£=\&pound;=g' \
    ...
      -e 's=ü=\&uuml;=g' \
      -e 's= =\&yacute;=g' \
      -e 's= æ=\&thorn;=g' \
      -e 's=ÿ=\&yuml;=g' \
      "$@"

    Here is a sample of the use of this filter:

      $ cat danish   Show sample Danish text in ISO 8859-1 encoding
      Øen med åen lå i læ af én halv‚äò,
      og én stor ‚äò, langs den græske kyst.

      $ iso8859-1-to-html danish   Convert text to HTML entities
      &Oslash;en med &aring;en l&aring; i l&aelig; af &eacute;n halv&oslash;,
      og &eacute;n stor &oslash;, langs den gr&aelig;ske kyst.

    Please check back next week for the conclusion to this article.



     
     
    >>> More BrainDump Articles          >>> More By O'Reilly Media
     

       

    BRAINDUMP ARTICLES

    - Demystifying SELinux on Kernel 2.6
    - Yahoo and Microsoft Create Ad Partnership
    - The Advantages of Obscure Open Source Browse...
    - Dell Announces CSI-style Digital Forensics S...
    - Milepost GCC Speeds Open-Source Development
    - Learn These 10 Programming Languages
    - Tomcat Capacity Planning
    - Internal and External Performance Tuning wit...
    - Tomcat Benchmark Procedure
    - Benchmarking Tomcat Performance
    - Tomcat Performance Tuning
    - Wubi: Windows-based Ubuntu Installer
    - Configuring and Optimizing Your I/O Scheduler
    - Linux I/O Schedulers
    - Advising the Linux Kernel on File I/O





    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 3 Hosted by Hostway
    Stay green...Green IT