Parsing XML with SAX and Python

In this article Nadia explains how to parse an XML document using the SAX API implementation available for Python.

This tutorial will explain how to parse an XML document using the SAX API implementation available for Python. Of course, there is more than one way to parse XML data with Python. In this article we will focus at its built-in SAX module.

You don’t have to be an expert in Python or XML in order to follow this article. On the contrary, this is an introduction rather than an in-depth analysis. Of course I assume that you have some basic knowledge of programming and –preferably- the Python syntax and structures. It will also be nice if you are aware of the basic XML principles and terms.

In the next part of this article, I will describe the SAX classes of Python. Afterwards, I will use an example in order to show how the theory can be applied. In the last parts I will provide some homework and some links that will help you to delve deeper in the subjects introduced in this article.

In any case, if you want to test the code in this tutorial, you will need Python 2.1 or later installed. I don’t provide any installation details; if you need them, I would recommend that you check other sources, like the article Getting Started With Python.

Our example is web-based therefore it would be nice if Python were integrated in your web server, but –of course- you may modify the script to run as standalone application or in any other way you desire.

After the end of the article you will be in a position to successfully use Python in order to parse XML documents with the SAX interfaces, but let’s first make sure we cover the theory…{mospagebreak title=The xml.sax Package&toc=1} SAX is a simple API for XML. The package xml.sax and its sub packages provide a Python implementation of the SAX interface.

The structure of a SAX application should include one or more input sources, parser and handler objects. The idea is as follows: a parser reads the bytes or characters from the input source and fires a sequence of events on the handler. In this document and in the Python documentation the term ‘reader’ is preferred over ‘parser’.

The SAX API defines four basic interfaces. Since Python does not support interfaces, these SAX interfaces are implemented in the xml.sax.handler module as the following Python classes:
  1. ContentHandler: this implements the main SAX interface for handling document events. It is also the interface which we will use in the example of the next section
  2. DTDHandler: class for handling DTD events
  3. EntityResolver: class for resolving external entities
  4. ErrorHandler: as the name suggests, this class is used for reporting all errors and warnings.
I would like to mention here the presence of the DefaultHandler class from the xml.sax.saxutils package that inherits from all four interfaces above. An application needs to implement only the interfaces it needs, as will be shown by the following example.

Now we have checked out the interfaces, it’s time to see the basic methods of the xml.sax package. These are:

make_parser() – This will create and return an SAX XMLReader object. Notice that the xml.sax readers are non-validating.

parse(filename, handler) – This will create a parser and parse the given document (which can be passed either as a file object or as a stream). The handler is one of the SAX interfaces we mentioned above.

A reader and a handler can be connected with the appropriate method (for example setContentHandler() for a ContentHandler object). Once this happens, the reader will notify of parsing events through the methods of the handler. In the following example, the methods startElement(), endElement() and characters() of the ContentHandler illustrate this procedure.

We will not go into error handling details in this document, but xml.sax provides enough exception classes for your programming needs. In the Python reference documentation you may find more details.

Enough with the theory, let’s move on to a hands-on example.{mospagebreak title=Our SAX Parser&toc=1} Now, let’s put the theory of the previous section in practice. Imagine that you have the statistics of the players of a basketball team in an XML document. Let’s say the name of the document is ‘playerStats.xml’. We will build a Python script that will take a player’s name as an input and then will search the document for the player’s statistics.

Let’s say that your XML document looks something like this:

<team>
  <player name=’Mick Fowler’ age=’27’ height=’1.96m’>
    <points>17.1</points>
    <rebounds>6.4</rebounds>
  </player>
  <player name=’Ivan Ivanovic’ age=’29’ height=’2.04m’>
    <points>15.5</points>
    <rebounds>7.8</rebounds>
  </player>
</team>


As you can see, the data of the player is saved as attributes of the ‘player’ element whilst the match averages of points and rebounds are contents of an element.

Let’s make a web form for the user to select one of the players. After clicking on the submit button, our script will parse the XML document return the user data together with his statistics for average points and rebounds.

The HTML code may be:

<form method=”get” action=”[your Python cgi script] “>
Please select a player:<br>
<select name=”playerName”>
<option value=”Mick Fowler”>Mick Fowler</option>
<option value=”Ivan Ivanovic”>Ivan Ivanovic</option>
</select><br>
<input type=”submit” value=”Get Player Stats”>
</form>


…which will look like in the following picture:





Now I will explain the steps necessary for our script to do the job. In the end of this section you will find the complete code.

First of all, in our code, we need to import all Python modules we will use:

from xml.sax import make_parser
from xml.sax.handler import ContentHandler
import cgi


Since we’re dealing with a CGI script, don’t forget to put on top the path to the python executable, which is usually (on Unix systems):

#!/usr/bin/python

Please note that the above line may vary, depending on your system configuration. Contact your system administrator for more details in case of doubt.

Also do not forget to define the content type of the CGI before any other output, so that no browsers get confused on what the content type of the page is:

print “Content-Type: text/plainn”

In your script you can use the print command in order to format the output of the script in any way you want. In the example script in the end of this page you will see that I used some basic formatting. Of course, if you want your page to look really good, go ahead and change what is provided here, but do not forget that for production applications it’s usually better to try and separate your logic from your HTML code.

Although, in production applications you may even want to structure your logic in various packages and classes, instead of doing it all in one script, like here. Since this is no software engineering course, I will not get into details here but return directly to our simple script.

The ‘heart’ of our code is our handler class, which uses the ContentHandler object:

class BasketBallHandler(ContentHandler):

This class has a constructor where we initialise the variables to be used in the rest of our code. Of course, we want to make sure we save the value of our search term in a variable. We also define two flags (isPointsElement and isReboundsElement) and give them the value of 0. As soon as the content of an element of interest is reached (for the ‘rebounds’ and ‘points’ elements) these flags will be set to 1.

def __init__ (self, searchTerm):

  self.searchTerm= searchTerm;
  self.isPointsElement, self.isReboundsElement = 0, 0;


I would like to mention here that all these comments and snippets of code are outlined to the left, but as you probably know already, Python is sensitive to tabbing of blocks of code. Take a look at the full code example in the end if in doubt about the correct syntax, order and tabbing.

Now we need to work with the elements of our XML document. This is why we will use the methods startElement() and endElement(). These methods will be invoked each and every time a new element in the document is –respectively- opened or closed.

The parameter name of the startElement() function contains the name of the element type. If you look again at our XML example, you will see that the ‘player’ element has some properties assigned. We obtain their values by using the get() method of the Attributes interface, then we save these values for later.

The elements ‘points’ and ‘rebounds’ in our XML document are a little different, in the sense that their value is not set in element properties. This means that what we need to parse is the element content. This is the job of the characters() method, where our variables playerPoints and playerRebounds will be loaded. This is why, at the moment a ‘points’ or ‘rebound’ element is found, we set our flags to 1. Here is how our startElement() method looks like:

def startElement(self, name, attrs):

   if name == ‘player':     
     self.playerName = attrs.get(‘name’,””)
     self.playerAge = attrs.get(‘age’,””)
     self.playerHeight = attrs.get(‘height’,””)
   elif name == ‘points':
     self.isPointsElement= 1;
     self.playerPoints = “”;
   elif name == ‘rebounds':
     self.isReboundsElement = 1;
     self.playerRebounds = “”;
   return


In the endElement() method we finally do the comparison of our search term with the value of the ‘name’ property. If they match, then we print our output. You can format this output anyway you like. This is also the proper place to re-set our flags, before the parser moves to the next element.

Here is how our endElement() method looks:

def endElement(self, name):
   if name == ‘points':
     self.isPointsElement= 0
   if name == ‘rebounds':
     self.inPlayersContent = 0
   if name == ‘player’ and self.searchTerm== self.playerName :
       print ‘<h2>Statistics for player:’ , self.playerName, ‘</h2><br>(age:’, self.playerAge , ‘height’ , self.playerHeight , “)<br>”
       print ‘Match average:’, self.playerPoints , ‘points,’ , self.playerRebounds, ‘rebounds’


The characters() method is invoked whenever a chunk of character data is found. Here is the place we use the flags set in our startElement() method; when they have the value of ‘1’ we load our variables with the data. Please note that all our character data are not necessarily returned in a single call. The function may split it in more than one chunks.

Here is how our characters() method looks:

def characters (self, ch):
   if self.isPointsElement== 1:
     self.playerPoints += ch
   if self.isReboundsElement == 1:
     self.playerRebounds += ch


So, this was it with the basic structure of our application! If you remember, this script will be called from a web form and our search term is in the field ‘playerName’ of this form. The following part is the ‘main’ code that does this job and uses the methods defined earlier.

The following snippet gets the player name from the playerName field of the form:

FormData = cgi.FieldStorage()
searchTerm= FormData["playerName"].value


Now that we have our search term, let’s initiate our parser and handler objects:

parser = make_parser()   
curHandler = BasketBallHandler(searchTerm)


With the help of the method setContentHandler(), we connect the implementation of the ContentHandler to our reader instance as it is shown here:

parser.setContentHandler(curHandler)

Finally we parse our XML document:

parser.parse(open(‘playerStats.xml’))

Here I paste the code of the finished script as a reference:

#!/usr/bin/python”

print “Content-Type: text/plainn”   
print “<html><body>”

from xml.sax import make_parser
from xml.sax.handler import ContentHandler
import cgi

class BasketBallHandler(ContentHandler):

 def __init__ (self, searchTerm):
   self.searchTerm= searchTerm;
   self.isPointsElement, self.isReboundsElement = 0, 0;
   
 def startElement(self, name, attrs):

   if name == ‘player':     
     self.playerName = attrs.get(‘name’,””)
     self.playerAge = attrs.get(‘age’,””)
     self.playerHeight = attrs.get(‘height’,””)
   elif name == ‘points':
     self.isPointsElement= 1;
     self.playerPoints = “”;
   elif name == ‘rebounds':
     self.isReboundsElement = 1;
     self.playerRebounds = “”;
   return

 def characters (self, ch):
   if self.isPointsElement== 1:
     self.playerPoints += ch
   if self.isReboundsElement == 1:
     self.playerRebounds += ch

 def endElement(self, name):
   if name == ‘points':
     self.isPointsElement= 0
   if name == ‘rebounds':
     self.inPlayersContent = 0
   if name == ‘player’ and self.searchTerm== self.playerName :
       print ‘<h2>Statistics for player:’ , self.playerName, ‘</h2><br>(age:’, self.playerAge , ‘height’ , self.playerHeight , “)<br>”
       print ‘Match average:’, self.playerPoints , ‘points,’ , self.playerRebounds, ‘rebounds’

FormData = cgi.FieldStorage()
searchTerm= FormData["playerName"].value
parser = make_parser()   
curHandler = BasketBallHandler(searchTerm)
parser.setContentHandler(curHandler)
parser.parse(open(‘playerStats.xml’))
print “</body></html>”
{mospagebreak title=Homework&toc=1} Even if your knowledge of XML and Python is not advanced I hope that by now you got an idea on how to implement a simple XML parser in Python. Of course this was just a very simple example to illustrate the possibilities of the xml.sax module of Python.

Here follow some ideas, based on our example, for further exploration:
  • Try to load the <select> list on our web form dynamically from the XML document; that is to parse the document and to present the player names found as elements of the list.
  • In our example we mentioned nothing about proper XML syntax and similar matters. In production applications things do not always turn out as we would hope. This is why exception handling should always be part of your application. You can try to use one or more exception classes to handle the behaviour of your script as soon as an error occurs, by –for example- showing the appropriate error message.
{mospagebreak title=Conclusion&toc=1} This tutorial covered the basics of the xml.sax package and its sub packages. This is not the only way to parse XML using Python. Hopefully in future articles we can take a look at other ways to do so and also take into account matters like XML validation and namespaces.

For the moment I would like to mention here PyXML, a toolkit with more than 200 XML-related modules and advanced implementations, including the xml.sax packages and sub packages. You can download it from here.

A very good reference site for Python and XML is “Uche Ogbuji’s Akara site for XML processing in Python“.

Also, you may want to check out the Python library reference (search for the xml.sax package and sub-packages)
[gp-comments width="770" linklove="off" ]

chat sex hikayeleri Ensest hikaye