An Introduction to XML - SGML, HTML and XML (
Page 2 of 6 )
SGML -Standard Generalized Markup Language
SGML is an international standard for describing electronic documents. SGML is
a meta language used to write other languages. SGML helps describe text documents
in a logical and structural manner. SGML is used primarily for the creation, storage,
and distribution of documents and as a source for conversion to other documents.
SGML documents have been used in the US military and American aviation industries
for many years. It is too complicated for web publishers and this is the reason
for the growth of HTML, a simplified subset of SGML.
HTML - Hyper Text Markup Language
HTML can be considered as the simplest subset of SGML and is simple enough to
have Web publishing accessible to anyone. Publishers do not necessarily need knowledge
of HTML as a lot of WYSIWYG editors are available in the market.
What are the problems with HTML?
HTML is too restrictive. Standard tags are predefined by W3C, so HTML is not
powerful enough to describe more complex documents. HTML is more presentation
oriented than content oriented, so HTML tags do not give an indication of the
meaning of the content. You may ask, why can't W3C introduce more tags to describe
content? Doing just that led to another problem: browser companies have introduced
new, proprietary tags to attract users to their products.
With current HTML, publishers have to do lot of adjustments to their documents
to be compatible with popular browsers. Browsers do not check for bad HTML code
and hence the Internet has a lot of documents with several HTML mistakes. These
issues were raised by content managers and Internet publishers and this problem
escalated to such an extent that W3C began to look for alternatives. What is the
solution?
XML - eXtensible Markup Language
XML can be considered as a simplified version of SGML. XML is
case sensitive. <p> is different from <P>. though in HTML both would be considered the
same.
XML is extensible - You can create your own elements to meet your publishing
demands. You need not wait for W3C HTML committee to release the next version
of HTML to include your required tags.
XML is structured - XML documents should adhere to a specific structure. If a
document is not structured properly, it is not considered to be XML.
XML is a much more accessible language than SGML. Since XML documents are well
structured, programmers can easily write software for rendering the XML documents.
XML has simple rules to differentiate between the document contents and the XML
markup elements.
XML markup elements start with either a less than symbol(<) or an ampersand
(&) character XML also uses greater than symbol (>), single quote (') and
the double quotation marks(") for markup. To use the above markup characters,
one should use the corresponding general XML entity (& for &, >
for >, < for <, &apos for ' and " for ").{mospagebreak title=What
is DTD - Document Type Definition}
A DTD can be considered the grammar for a markup language. It is a set of regulations
that specifies the usage of XML markup. It defines elements, an element's attributes
and its values, and contains specifications about which elements can be contained
in others. DTD can also define entities.
We will consider an example DTD for email:
<!ELEMENT Mail (From, To, Cc?, Date?, Subject, Body)>
<!ELEMENT From (#PCDATA)
>
<!ELEMENT To (#PCDATA) >
<!ELEMENT Cc (#PCDATA) >
<!ELEMENT Date
(#PCDATA) >
<!ELEMENT Subject (#PCDATA) >
<!ELEMENT Body (#PCDATA | P
| Br)* >
<!ELEMENT P (#PCDATA | Br)* >
<!ATTLIST P align (left | right
| justify) "left" >
<!ELEMENT Br EMPTY >
Description
A XML document conforming to the mail DTD has only one From, one To, an optional
Cc, an optional Date, one Subject and one body.
- A From element has only text.
- A To element has only text.
- A Cc element has only text.
- A Date element has only text.
- A Subject element has only text.
- A Body element can have text and zero or more of P and Br elements.
- A P element can have text and zero or more of Br element
- The P element has an align attribute. The attributes possible values are left,
justify or right. Its default value is left.
- The Br element is empty.
A XML parser (discussed in the software section) will use the DTD to parse the
document. The DTDs enable you to publish your documents to be used by others.
The XML document should have instructions to tell the XML processing programs
to find out the DTD.
A <!DOCTYPE> element at the start of the XML file will instruct the program
about the location of the DTD. For example:
<!DOCTYPE Mail system "http://infowest.com/DTDS/mail.dtd">
<Mail>
..
..
..
</Mail>