Home arrow BrainDump arrow Page 3 - Pipelines Can Do Amazing Things

5.2 Structured Data for the Web - BrainDump

In this two-part series, you will learn how to handle text processing jobs in Unix with pipelines. This article is excerpted from chapter 5 of Classic Shell Scripting, written by Arnold Robbins and Nelson H.F. Beebe (O'Reilly; ISBN: 0596005954). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

  1. Pipelines Can Do Amazing Things
  2. Extracting Data from Structured Text Files, continued
  3. 5.2 Structured Data for the Web
  4. Structured Data for the Web, continued
By: O'Reilly Media
Rating: starstarstarstarstar / 4
June 26, 2008

print this article



The immense popularity of the World Wide Web makes it desirable to be able to present data like the office directory developed in the last section in a form that is a bit fancier than our simple text file.

Web files are mostly written in a markup language called HyperText Markup Language (HTML). This is a family of languages that are specific instances of the Standard Generalized Markup Language (SGML), which has been defined in several ISO standards since 1986. The manuscript for this book was written in DocBook/XML, which is also a specific instance of SGML. You can find a full description of HTML in HTML & XHTML: The Definitive Guide (O’Reilly).*

A Digression on Databases

Most commercial databases today are constructed as relational databases: data is accessible as key:value pairs, and join operations are used to construct multicolumn tables to provide views of selected subsets of the data. Relational databases were first proposed in 1970 by E. F. Codd,a who actively promoted them, despite initial database industry opposition that they could not be implemented efficiently. Fortunately, clever programmers soon figured out how to solve the efficiency problem. Codd’s work is so important that, in 1981, he was given the prestigious ACM Turing Award, the closest thing in computer science to the Nobel Prize.

Today, there are several ISO standards for the Structured Query Language (SQL), making vendor-independent database access possible, and one of the most important SQL operations is join. Hundreds of books have been published about SQL; to learn more, pick a general one like SQL in a Nutshell.b Our simple office-directory task thus has an important lesson in it about the central concept of modern relational databases, and Unix software tools can be extremely valuable in preparing input for databases, and in processing their output.

For the purposes of this section, we need only a tiny subset of HTML, which we present here in a small tutorial. If you are already familiar with HTML, just skim the next page or two.

Here is a minimal standards-conformant HTML file produced by a useful tool written by one of us:*

  $ echo Hello, world. | html-pretty 
  <!-- -*-html-*- -->
  <!-- Prettyprinted by html-pretty flex version 1.01 [25-Aug-2001] -->
  <!-- on Wed Jan 8 12:12:42 2003 -->
  <!-- for Adrian W. Jones (jones@example.com) -->

<!-- Please supply a descriptive title here -->
      <!-- Please supply a correct e-mail address here -->
      <LINK REV="made" HREF="mailto:jones@example.com">
      Hello, world.

The points to note in this HTML output are:

  1. HTML comments are enclosed in <!-- and -->.
  2. Special processor commands are enclosed in <! and >:  here, the DOCTYPE command tells an SGML parser what the document type is and where to find its grammar file.
  3. Markup is supplied by angle-bracketed words, called tags. In HTML, lettercase is not significant in tag names: html-pretty normally uppercases tag names for better visibility.
  4. Markup environments consist of a begin tag, <NAME>, and an end tag, </NAME>, and for many tags, environments can be nested within each other according to rules defined in the HTML grammars.
  5. An HTML document is structured as an HTML object containing one HEAD and one BODY object.
  6. Inside the HEAD, a TITLE object defines the document title that web browsers display in the window titlebar and in bookmark lists. Also inside the HEAD, the LINK object generally carries information about the web-page maintainer.
  7. The visible part of the document that browsers show is the contents of the BODY.
  8. Whitespace is not significant outside of quoted strings, so we can use horizontal and vertical spacing liberally to emphasize the structure, as the HTML prettyprinter does.
  9. Everything else is just printable ASCII text, with three exceptions. Literal angle brackets must be represented by special encodings, called entities, that consist of an ampersand, an identifier, and a semicolon: &lt; and &gt;. Since ampersand starts entities, it has its own literal entity name: &amp;. HTML supports a modest repertoire of entities for accented characters that cover most of the languages of Western Europe so that we can write, for example, caf&eacute; du bon go&ucirc;t  to get café du bon goût.
  10. Although not shown in our minimal example, font style changes are accomplished in HTML with B (bold), EM (emphasis), I (italic), STRONG (extra bold), and TT (typewriter (fixed-width characters)) environments: write <B>bold phrase</B> to get bold phrase.

To convert our office directory to proper HTML, we need only one more bit of information: how to format a table, since that is what our directory really is and we don’t want to force the use of typewriter fonts to get everything to line up in the browser display.

In HTML 3.0 and later, a table consists of a TABLE environment, inside of which are rows, each of them a table row (TR) environment. Inside each row are cells, called table data, each a TD environment. Notice that columns of data receive no special markup: a data column is simply the set of cells taken from the same row position in all of the rows of the table. Happily, we don’t need to declare the number of rows and columns in advance. The job of the browser or formatter is to collect all of the cells, determine the widest cell in each column, and then format the table with columns just wide enough to hold those widest cells.

For our office directory example, we need just three columns, so our sample entry could be marked up like this:

Jones, Adrian W.

An equivalent, but compact and hard-to-read, encoding might look like this:

<TR><TD>Jones, Adrian W.</TD><TD>555-0123</TD><TD>OSD211</TD></TR>

>>> More BrainDump Articles          >>> More By O'Reilly Media

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Apple Founder Steve Jobs Dies
- Steve Jobs` Era at Apple Ends
- Google's Chrome Developer Tool Updated
- Google's Chrome 6 Browser Brings Speed to th...
- New Open Source Update Fedora 13 is Released...
- Install Linux with Knoppix
- iPad Developers Flock To SDK 3.2
- Managing a Linux Wireless Access Point
- Maintaining a Linux Wireless Access Point
- Securing a Linux Wireless Access Point
- Configuring a Linux Wireless Access Point
- Building a Linux Wireless Access Point
- Migrating Oracle to PostgreSQL with Enterpri...
- Demystifying SELinux on Kernel 2.6
- Yahoo and Microsoft Create Ad Partnership

Developer Shed Affiliates


Dev Shed Tutorial Topics: