Python
  Home arrow Python arrow Page 3 - Parsing XML with SAX and Python
Dev Shed Forums 
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Forums Sitemap 
IBM® developerWorks 
Sun Developer Network 
Dedicated Servers 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
Moblin 
JMSL Numerical Library 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
PYTHON

Parsing XML with SAX and Python
By: Nadia Poulou
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 12
    2003-01-21

    Table of Contents:
  • Parsing XML with SAX and Python
  • The xml.sax Package
  • Our SAX Parser
  • Homework
  • Conclusion

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Parsing XML with SAX and Python - Our SAX Parser


    (Page 3 of 5 )

    Now, let’s put the theory of the previous section in practice. Imagine that you have the statistics of the players of a basketball team in an XML document. Let’s say the name of the document is ‘playerStats.xml’. We will build a Python script that will take a player’s name as an input and then will search the document for the player’s statistics.

    Let’s say that your XML document looks something like this:

    <team>
      <player name='Mick Fowler' age='27' height='1.96m'>
        <points>17.1</points>
        <rebounds>6.4</rebounds>
      </player>
      <player name='Ivan Ivanovic' age='29' height='2.04m'>
        <points>15.5</points>
        <rebounds>7.8</rebounds>
      </player>
    </team>


    As you can see, the data of the player is saved as attributes of the ‘player’ element whilst the match averages of points and rebounds are contents of an element.

    Let’s make a web form for the user to select one of the players. After clicking on the submit button, our script will parse the XML document return the user data together with his statistics for average points and rebounds.

    The HTML code may be:

    <form method="get" action="[your Python cgi script] ">
    Please select a player:<br>
    <select name="playerName">
    <option value="Mick Fowler">Mick Fowler</option>
    <option value="Ivan Ivanovic">Ivan Ivanovic</option>
    </select><br>
    <input type="submit" value="Get Player Stats">
    </form>


    …which will look like in the following picture:





    Now I will explain the steps necessary for our script to do the job. In the end of this section you will find the complete code.

    First of all, in our code, we need to import all Python modules we will use:

    from xml.sax import make_parser
    from xml.sax.handler import ContentHandler
    import cgi


    Since we’re dealing with a CGI script, don’t forget to put on top the path to the python executable, which is usually (on Unix systems):

    #!/usr/bin/python

    Please note that the above line may vary, depending on your system configuration. Contact your system administrator for more details in case of doubt.

    Also do not forget to define the content type of the CGI before any other output, so that no browsers get confused on what the content type of the page is:

    print "Content-Type: text/plain\n"

    In your script you can use the print command in order to format the output of the script in any way you want. In the example script in the end of this page you will see that I used some basic formatting. Of course, if you want your page to look really good, go ahead and change what is provided here, but do not forget that for production applications it’s usually better to try and separate your logic from your HTML code.

    Although, in production applications you may even want to structure your logic in various packages and classes, instead of doing it all in one script, like here. Since this is no software engineering course, I will not get into details here but return directly to our simple script.

    The ‘heart’ of our code is our handler class, which uses the ContentHandler object:

    class BasketBallHandler(ContentHandler):

    This class has a constructor where we initialise the variables to be used in the rest of our code. Of course, we want to make sure we save the value of our search term in a variable. We also define two flags (isPointsElement and isReboundsElement) and give them the value of 0. As soon as the content of an element of interest is reached (for the ‘rebounds’ and ‘points’ elements) these flags will be set to 1.

    def __init__ (self, searchTerm):

      self.searchTerm= searchTerm;
      self.isPointsElement, self.isReboundsElement = 0, 0;


    I would like to mention here that all these comments and snippets of code are outlined to the left, but as you probably know already, Python is sensitive to tabbing of blocks of code. Take a look at the full code example in the end if in doubt about the correct syntax, order and tabbing.

    Now we need to work with the elements of our XML document. This is why we will use the methods startElement() and endElement(). These methods will be invoked each and every time a new element in the document is –respectively- opened or closed.

    The parameter name of the startElement() function contains the name of the element type. If you look again at our XML example, you will see that the ‘player’ element has some properties assigned. We obtain their values by using the get() method of the Attributes interface, then we save these values for later.

    The elements ‘points’ and ‘rebounds’ in our XML document are a little different, in the sense that their value is not set in element properties. This means that what we need to parse is the element content. This is the job of the characters() method, where our variables playerPoints and playerRebounds will be loaded. This is why, at the moment a ‘points’ or ‘rebound’ element is found, we set our flags to 1. Here is how our startElement() method looks like:

    def startElement(self, name, attrs):

       if name == 'player':     
         self.playerName = attrs.get('name',"")
         self.playerAge = attrs.get('age',"")
         self.playerHeight = attrs.get('height',"")
       elif name == 'points':
         self.isPointsElement= 1;
         self.playerPoints = "";
       elif name == 'rebounds':
         self.isReboundsElement = 1;
         self.playerRebounds = "";
       return


    In the endElement() method we finally do the comparison of our search term with the value of the ‘name’ property. If they match, then we print our output. You can format this output anyway you like. This is also the proper place to re-set our flags, before the parser moves to the next element.

    Here is how our endElement() method looks:

    def endElement(self, name):
       if name == 'points':
         self.isPointsElement= 0
       if name == 'rebounds':
         self.inPlayersContent = 0
       if name == 'player' and self.searchTerm== self.playerName :
           print '<h2>Statistics for player:' , self.playerName, '</h2><br>(age:', self.playerAge , 'height' , self.playerHeight , ")<br>"
           print 'Match average:', self.playerPoints , 'points,' , self.playerRebounds, 'rebounds'


    The characters() method is invoked whenever a chunk of character data is found. Here is the place we use the flags set in our startElement() method; when they have the value of ‘1’ we load our variables with the data. Please note that all our character data are not necessarily returned in a single call. The function may split it in more than one chunks.

    Here is how our characters() method looks:

    def characters (self, ch):
       if self.isPointsElement== 1:
         self.playerPoints += ch
       if self.isReboundsElement == 1:
         self.playerRebounds += ch


    So, this was it with the basic structure of our application! If you remember, this script will be called from a web form and our search term is in the field ‘playerName’ of this form. The following part is the ‘main’ code that does this job and uses the methods defined earlier.

    The following snippet gets the player name from the playerName field of the form:

    FormData = cgi.FieldStorage()
    searchTerm= FormData["playerName"].value


    Now that we have our search term, let’s initiate our parser and handler objects:

    parser = make_parser()   
    curHandler = BasketBallHandler(searchTerm)


    With the help of the method setContentHandler(), we connect the implementation of the ContentHandler to our reader instance as it is shown here:

    parser.setContentHandler(curHandler)

    Finally we parse our XML document:

    parser.parse(open('playerStats.xml'))

    Here I paste the code of the finished script as a reference:

    #!/usr/bin/python"

    print "Content-Type: text/plain\n"   
    print "<html><body>"

    from xml.sax import make_parser
    from xml.sax.handler import ContentHandler
    import cgi

    class BasketBallHandler(ContentHandler):

     def __init__ (self, searchTerm):
       self.searchTerm= searchTerm;
       self.isPointsElement, self.isReboundsElement = 0, 0;
       
     def startElement(self, name, attrs):

       if name == 'player':     
         self.playerName = attrs.get('name',"")
         self.playerAge = attrs.get('age',"")
         self.playerHeight = attrs.get('height',"")
       elif name == 'points':
         self.isPointsElement= 1;
         self.playerPoints = "";
       elif name == 'rebounds':
         self.isReboundsElement = 1;
         self.playerRebounds = "";
       return

     def characters (self, ch):
       if self.isPointsElement== 1:
         self.playerPoints += ch
       if self.isReboundsElement == 1:
         self.playerRebounds += ch

     def endElement(self, name):
       if name == 'points':
         self.isPointsElement= 0
       if name == 'rebounds':
         self.inPlayersContent = 0
       if name == 'player' and self.searchTerm== self.playerName :
           print '<h2>Statistics for player:' , self.playerName, '</h2><br>(age:', self.playerAge , 'height' , self.playerHeight , ")<br>"
           print 'Match average:', self.playerPoints , 'points,' , self.playerRebounds, 'rebounds'

    FormData = cgi.FieldStorage()
    searchTerm= FormData["playerName"].value
    parser = make_parser()   
    curHandler = BasketBallHandler(searchTerm)
    parser.setContentHandler(curHandler)
    parser.parse(open('playerStats.xml'))
    print "</body></html>"

    More Python Articles
    More By Nadia Poulou


     

       

    PYTHON ARTICLES

    - SSH with Twisted
    - Mobile Programming in Python using PyS60: UI...
    - Python: Count on It
    - Python Strings: Spinning Yarns
    - Python: More Fun with Strings
    - Python: Stringing You Along
    - Python Operators
    - Bluetooth Programming in Python: Network Pro...
    - Python Sets
    - Python Conditionals, Lists, Dictionaries, an...
    - Python: Input and Variables
    - Introduction to Python Programming
    - Mobile Programming in Python using PyS60: Ge...
    - Bluetooth Programming using Python
    - Finishing the PyMailGUI Client: User Help To...





    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway