Working with XML Documents and Python

XML can be used for describing data without needing a database. However, this leaves us with the problem of interpreting the data embedded within the XML. This is where Python comes to the rescue, as Peyton explains.

Introduction

XML, or eXtensible Markup Language, is a useful tool for describing all sorts of data. Take a look at this example:

<?xml version=”1.0″ encoding=”UTF-8″?>
<devshed>
   <categories>
      <python>
         <article>
            <title>MySQL Connectivity With Python</title>
            <author>Icarus</author>
         </article>
         <article>
            <title>Python UnZipped</title>
            <author>Mark Lee Smith</author>
         </article>
      </python>
      <php>
         <article>
            <title>Writing Clean and Efficient PHP Code</title>
            <author>David Fells</author>
         </article>
         <article>
            <title>Using the PHP Crypt Function</title>
            <author>Chris Root</author>
         </article>
      </php>
   </categories>
</devshed>

We describe four articles in the DevShed backend. We start with a tag named devshed. Inside it, the tag categories contains two names, python and php. Inside each of these is an article tag, containing a title tag and an author tag. With this system, we are able to describe the data contained very easily and with no database. The data can be modified later with little effort and by someone with little experience.

However, we are faced with the problem of interpreting the data embedded within the XML. We need something to parse it with. Python includes a few tools that can aid us in parsing the data, and we’ll take a look at them in this article, learning about them through example.

{mospagebreak title=Organizing a Book Collection}

Let’s say we want to organize a book collection using XML to describe it all. We don’t need anything fancy. We only need to store the title, author, and genre of the book. Let’s go ahead and create the markup for a few books:

<?xml version=”1.0″ encoding=”UTF-8″?>
<collection>
   <book>
      <title>The Once and Future King</title>
      <author>T.H. White</author>
      <genre>Fantasy</genre>
   </book>
   <book>
      <title>The Curse of Chalion</title>
      <author>Lois McMaster Bujold</author>
      <genre>Fantasy</genre>
   </book>
   <book>
      <title>Paladin of Souls</title>
      <author>Lois McMaster Bujold</author>
      <genre>Fantasy</genre>
   </book>
   <book>
      <title>Alas, Babylon</title>
      <author>Pat Frank</author>
      <genre>Fiction</genre>
   </book>
   <book>
      <title>Rifles for Wattie</title>
      <author>Harold Keith</author>
      <genre>Fiction</genre>
   </book>
</collection>

Now we’re left with parsing the data and turning it into something presentable. If you examine the way the data is stored, you will notice that it is similar to a dictionary in Python. Therefore, a dictionary would be an ideal type to store the data in. We’ll create a chunk of code that does just this. SAX, Simple API for XML, will be used for this project. It is contained in xml.sax:

import xml.sax

# Create a collection list
collection = []

# This handles the parsing of the content
class HandleCollection ( xml.sax.ContentHandler ):

   def __init__ ( self ):

      self.book = {}
      self.title = False
      self.author = False
      self.genre = False

   # Called at the start of an element
   def startElement ( self, name, attributes ):

      if name == ‘title’:
         self.title = True
      elif name == ‘author’:
         self.author = True
      elif name == ‘genre’:
         self.genre = True

   # Called at the end of an element
   def endElement ( self, name ):

      if name == ‘book’:
         collection.append ( self.book )
         self.book = {}
      elif name == ‘title’:
         self.title = False
      elif name == ‘author’:
         self.author = False
      elif name == ‘genre’:
         self.genre = False

   # Called to handle content besides elements
   def characters ( self, content ):

      if self.title:
         self.book [ 'title' ] = content
      elif self.author:
         self.book [ 'author' ] = content
      elif self.genre:
         self.book [ 'genre' ] = content

# Parse the collection
parser = xml.sax.make_parser()
parser.setContentHandler ( HandleCollection() )
parser.parse ( ‘collection.xml’ )

As you can see, there’s really not much work involved. All we have to do is write the instructions that organize each book into a dictionary and put all the dictionaries into the collection list. We start by subclassing xml.sax.ContentHandler. The class we create is charged with handling the content of the document we parse. In our class’s __init__ method, we define a few variables. The book dictionary will, of course, house the book’s information. The title variable will be used by the characters method to determine whether we are dealing with the title tag’s content. The same goes for the author variable and the genre variable. These are set to True in startElement if we’re dealing with that particular element. They are then set to False when we have finished using them in endElement. Finally, we instruct Python to parse the file in the last three lines.

We are now free to present this information to the user in whichever way we see fit. For example, if we wanted to just output the book information without dressing it up too much, we could simply append some code to the above script that sorts through the list of dictionaries that the script creates:

for book in collection:
   print
   print ‘Title:  ‘, book [ 'title' ]
   print ‘Author: ‘, book [ 'author' ]
   print ‘Genre:  ‘, book [ 'genre' ]

{mospagebreak title=Describing a Music Library}

While the above XML structure is fine for many things, compacting things into attributes is often very helpful and a much better idea. Let’s consider a music library (consisting of individual songs rather than whole albums). Instead of creating indivual tags for the album the song comes from, the artist of the song, the name of the song and length of the song, which would get tiresome to type, we could simply create attributes for each of these properties. Here’s an XML file that does this:

<?xml version=”1.0″ encoding=”UTF-8″?>
<library>
   <track artist=’Boston’ album=’Greatest Hits’ time=’5:04′>Peace
of Mind</track>
   <track artist=’Dire Straits’ album=’Sultans Of Swing-Best of
Dire Straits’ time=’5:50′>Sultans of Swing</track>
   <track artist=’Dire Straits’ album=’Sultans Of Swing-Best of
Dire Straits’ time=’4:12′>Walk of Life</track>
   <track artist=’The Eagles’ album=’The Very Best of The Eagles’
time=’3:33′>Take It Easy</track>
   <track artist=’Gary Allan’ album=’Best I Ever Had’
time=’4:18′>Best I Ever Had</track>
   <track artist=’Goo Goo Dolls’ album=’Dizzy Up the Girl’
time=’4:50′>Iris</track>
   <track artist=’Kansas’ album=’The Ultimate Kansas’
time=’3:24′>Dust in the Wind</track>
   <track artist=’Toby Keith’ album=’Greatest Hits 2′
time=’3:27′>How Do You Like Me Now?!</track>
   <track artist=’Toby Keith’ album=’Greatest Hits 2′
time=’3:15′>Courtest of the Red, White and Blue</track>
   <track artist=’ZZ Top’ album=’ZZ Top-Greatest Hits’
time=’4:03′>Got Me Under Pressure</track>
</library>

If we had given every property of the song its own tag, then the file would have been much longer than it is now. However, we are able to reduce the length of the file by putting properties in attributes.

Now, however, we are left with the task of parsing the data. The process is similar to what we did above, but there are some differences since we are dealing with attributes this time around. Here’s how it all works:

import xml.sax

# Create a class to handle the contents of the XML file
class HandleLibrary ( xml.sax.ContentHandler ):

   def __init__ ( self ):

      self.artist = ”
      self.album = ”
      self.time = ”

   # Handle the start of an element
   def startElement ( self, name, attributes ):

      # Check to see if it is a “track” element
      # If so, store the attributes
      if name == ‘track’:
          self.artist = attributes.getValue ( ‘artist’ )
          self.album = attributes.getValue ( ‘album’ )
          self.time = attributes.getValue ( ‘time’ )

   # Handle content
   def characters ( self, content ):

      # If the content isn’t a newline or blank space, then we
have the track name
      # Print out all the track info
      if ( content != ‘n’ ) and ( content.replace ( ‘ ‘, ” ) !=
” ):
         print
         print ‘Track:  ‘ + content
         print ‘Artist: ‘ + self.artist
         print ‘Album:  ‘ + self.album
         print ‘Length: ‘ + self.time

# Parse the file
parser = xml.sax.make_parser()
parser.setContentHandler ( HandleLibrary() )
parser.parse ( ‘music.xml’ )

It’s not a very complex script, and it’s not very lengthy. It strongly resembles the previous script, but note that we choose not to store anything in a dictionary. Rather, we just dump it all out to the user as we receive it. The attributes variable in the startElement method is an object representing all the attributes of that tag. We then access the attributes by name with the getValue method, saving the values in variables that we print out in the characters method. That’s all there is to parsing attributes.

What if, however, we do not know the names of the attributes? It isn’t too much of a problem, since we can loop through all the attributes and get their values:

import xml.sax

# Create a class to handle the XML
class HandleLibrary ( xml.sax.ContentHandler ):

   def __init__ ( self ):

      self.attributes = None

   # Handle the beginning part of each tag
   def startElement ( self, name, attributes ):

      # Check to see if we’re dealing with a track tag
      # If we are, store the attributes
      if name == ‘track’:
         self.attributes = attributes

   # Handle content
   def characters ( self, content ):

      if ( content != ‘n’ ) and ( content.replace ( ‘ ‘, ” ) != ” ):

         # Loop through each attribute and print the name and value
         print
         print ‘Track: ‘ + content
         for attribute in self.attributes.getNames():
            print attribute [ 0 ].upper() + attribute [ 1: ] + ‘: ‘ + self.attributes.getValue ( attribute )
         print

# Parse it all
parser = xml.sax.make_parser()
parser.setContentHandler ( HandleLibrary() )
parser.parse ( ‘music.xml’ )

In the above script, we use the getNames method to retrieve a list of attribute names. We loop through the list and print each attribute’s name (with the first letter capitalized) and value.

{mospagebreak title=The Document Object Model}

SAX is not the only way to process XML data in Python. The Document Object Model exists, and it gives us an object-oriented interface to XML data. Python contains a library called minidom that provides a simple interface to DOM without a whole lot of bells and whistles. Let’s recreate our music library parser, making use of DOM this time:

import xml.dom.minidom

# Load the music library
library = xml.dom.minidom.parse ( ‘music.xml’ )

# Get a list of the tracks
tracks = library.documentElement.getElementsByTagName ( ‘track’ )

# Go through each track
for track in tracks:

   # Print each track’s information
   print
   print ‘Track:  ‘ + track.childNodes [ 0 ].nodeValue
   print ‘Artist: ‘ + track.attributes [ 'artist' ].nodeValue
   print ‘Album:  ‘ + track.attributes [ 'album' ].nodeValue
   print ‘Length: ‘ + track.attributes [ 'time' ].nodeValue

We end up with a very short script in the above example. We start by pointing Python to the file we wish to parse. Then we get all tags by the name of “track” that belong to the main tag. We loop through the list provided, printing out the information contained within. To access the text inside the element, we access the track’s list of childNodes. The text is stored in a node, and we print out the value of it. The attributes are stored in attributes, and we reference them by name, printing out nodeValue.

Again, though, what if we don’t know everything we are going to parse? Let’s recreate the second example of the previous section using DOM:

import xml.dom.minidom

# Load the XML file
library = xml.dom.minidom.parse ( ‘music.xml’ )

# Get a list of tracks
tracks = library.documentElement.getElementsByTagName ( ‘track’ )

# Loop through the tracks
for track in tracks:

   # Print the track name
   print
   print ‘Track: ‘ + track.childNodes [ 0 ].nodeValue

   # Loop through the attributes
   for attribute in track.attributes.keys():
      print attribute [ 0 ].upper() + attribute [ 1: ] + ‘: ‘ +
track.attributes [ attribute ].nodeValue

We loop through the names of the attributes returned in the keys method, printing the name of the attribute out with the first letter capitalized. We then print the value of the attribute out by referencing the attribute by its key and then accessing nodeValue.

Let’s say we have tags nested in each other. Consider our very first script that parsed a book collection. Let’s rebuild it with DOM:

import xml.dom.minidom

# Load the book collection
collection = xml.dom.minidom.parse ( ‘collection.xml’ )

# Get a list of books
books = collection.documentElement.getElementsByTagName
( ‘book’ )

# Loop through the books
for book in books:

   # Print out the book’s information
   print
   print ‘Title:  ‘ + book.getElementsByTagName ( ‘title’ )
[ 0 ].childNodes [ 0 ].nodeValue
   print ‘Author: ‘ + book.getElementsByTagName ( ‘author’ )
[ 0 ].childNodes [ 0 ].nodeValue
   print ‘Genre:  ‘ + book.getElementsByTagName ( ‘genre’ )
[ 0 ].childNodes [ 0 ].nodeValue

We load the collection file and then get a list of books. Then, we loop through the list of books and print out what’s wrapped inside of each tag, which is in the form of a child node that we must get the value of. It’s all very simple.

DOM includes many more features, but all you need to simply read a document is contained within the minidom module.

Conclusion

XML is a useful tool for describing data. It can be used to describe just about anything –- from book collections and music libraries to user settings for an application. Python contains a few utilities that can be used to read and process XML data, namely SAX and DOM. SAX allows you to create handler classes that can process individual items within an XML document, and DOM abstracts the entire document, allowing you to easily navigate through a tree of XML data. Both tools are extremely simple to use and contribute to the phrase “batteries included” that is used to describe the Python language.

[gp-comments width="770" linklove="off" ]
antalya escort bayan antalya escort bayan