Working with XML Documents and Python - The Document Object Model (
Page 4 of 4 )
SAX is not the only way to process XML data in Python. The Document Object Model exists, and it gives us an object-oriented interface to XML data. Python contains a library called minidom that provides a simple interface to DOM without a whole lot of bells and whistles. Let's recreate our music library parser, making use of DOM this time:
import xml.dom.minidom
# Load the music library
library = xml.dom.minidom.parse ( 'music.xml' )
# Get a list of the tracks
tracks = library.documentElement.getElementsByTagName ( 'track' )
# Go through each track
for track in tracks:
# Print each track's information
print
print 'Track: ' + track.childNodes [ 0 ].nodeValue
print 'Artist: ' + track.attributes [ 'artist' ].nodeValue
print 'Album: ' + track.attributes [ 'album' ].nodeValue
print 'Length: ' + track.attributes [ 'time' ].nodeValue
We end up with a very short script in the above example. We start by pointing Python to the file we wish to parse. Then we get all tags by the name of “track” that belong to the main tag. We loop through the list provided, printing out the information contained within. To access the text inside the element, we access the track's list of childNodes. The text is stored in a node, and we print out the value of it. The attributes are stored in attributes, and we reference them by name, printing out nodeValue.
Again, though, what if we don't know everything we are going to parse? Let's recreate the second example of the previous section using DOM:
import xml.dom.minidom
# Load the XML file
library = xml.dom.minidom.parse ( 'music.xml' )
# Get a list of tracks
tracks = library.documentElement.getElementsByTagName ( 'track' )
# Loop through the tracks
for track in tracks:
# Print the track name
print
print 'Track: ' + track.childNodes [ 0 ].nodeValue
# Loop through the attributes
for attribute in track.attributes.keys():
print attribute [ 0 ].upper() + attribute [ 1: ] + ': ' +
track.attributes [ attribute ].nodeValue
We loop through the names of the attributes returned in the keys method, printing the name of the attribute out with the first letter capitalized. We then print the value of the attribute out by referencing the attribute by its key and then accessing nodeValue.
Let's say we have tags nested in each other. Consider our very first script that parsed a book collection. Let's rebuild it with DOM:
import xml.dom.minidom
# Load the book collection
collection = xml.dom.minidom.parse ( 'collection.xml' )
# Get a list of books
books = collection.documentElement.getElementsByTagName
( 'book' )
# Loop through the books
for book in books:
# Print out the book's information
print
print 'Title: ' + book.getElementsByTagName ( 'title' )
[ 0 ].childNodes [ 0 ].nodeValue
print 'Author: ' + book.getElementsByTagName ( 'author' )
[ 0 ].childNodes [ 0 ].nodeValue
print 'Genre: ' + book.getElementsByTagName ( 'genre' )
[ 0 ].childNodes [ 0 ].nodeValue
We load the collection file and then get a list of books. Then, we loop through the list of books and print out what's wrapped inside of each tag, which is in the form of a child node that we must get the value of. It's all very simple.
DOM includes many more features, but all you need to simply read a document is contained within the minidom module.
Conclusion
XML is a useful tool for describing data. It can be used to describe just about anything –- from book collections and music libraries to user settings for an application. Python contains a few utilities that can be used to read and process XML data, namely SAX and DOM. SAX allows you to create handler classes that can process individual items within an XML document, and DOM abstracts the entire document, allowing you to easily navigate through a tree of XML data. Both tools are extremely simple to use and contribute to the phrase “batteries included” that is used to describe the Python language.