Beginning Python (2005)
.pdfUsing Python for XML
#Get all the book elements in the library books = myLibrary.getElementsByTagName(“book”)
#Print each book’s title and author(s) printLibrary(myLibrary)
#Insert a new book in the library newBook = myDoc.createElement(“book”)
newBookTitle = myDoc.createElement(“title”) titleText = myDoc.createTextNode(“Beginning Python”) newBookTitle.appendChild(titleText) newBook.appendChild(newBookTitle)
newBookAuthor = myDoc.createElement(“author”) authorName = myDoc.createTextNode(“Peter Norton, et al”) newBookAuthor.appendChild(authorName) newBook.appendChild(newBookAuthor)
myLibrary.appendChild(newBook)
print “Added a new book!” printLibrary(myLibrary)
#Remove a book from the library #Find ellison book
for book in myLibrary.getElementsByTagName(“book”): for author in book.getElementsByTagName(“author”):
if author.childNodes[0].data.find(“Ellison”) != -1: removedBook= myLibrary.removeChild(book) removedBook.unlink()
print “Removed a book.” printLibrary(myLibrary)
#Write back to the library file lib = open(“library.xml”, ‘w’) lib.write(myDoc.toprettyxml(“ “)) lib.close()
3.Run the file with python xml_minidom.py.
How It Works
To create a DOM, the document needs to be parsed into a document tree. This is accomplished by calling the parse method from xml.dom.minidom. This method returns a Document object,
which contains methods for querying for child nodes, getting all nodes in the document of a certain name, and creating new nodes, among other things. The getElementsByTagName method returns a list of Node objects whose names match the argument, which is used to extract the root node of the document, the <library> node. The print method uses getElementsByTagName again, and then for each book node, prints the title and author. Nodes with text that follows them are considered to have a single child node, and the text is stored in the data attribute of that node, so book.getElementsByTagName(“title”)[0].childNodes[0].data simply retrieves the text node below the <title> element and returns its data as a string.
291
TEAM LinG
Chapter 15
Constructing a new node in DOM requires creating a new node as a piece of the Document object, adding all necessary attributes and child nodes, and then attaching it to the correct node in the document tree. The createElement(tagName) method of the Document object creates a new node with a tag name set to whatever argument has been passed in. Adding text nodes is accomplished almost the same way, with a call to createTextNode(string). When all the nodes have been created, the structure is created by calling the appendChild method of the node to which the newly created node will be attached. Node also has a method called insertBefore(newChild, refChild) for inserting nodes in an arbitrary location in the list of child nodes, and replaceChild(newChild, oldChild) to replace one node with another.
Removing nodes requires first getting a reference to the node being removed and then a call to removeChild(childNode). After the child has been removed, it’s advisable to call unlink() on it to force garbage collection for that node and any children that may still be attached. This method is specific to the minidom implementation and is not available in xml.dom.
Finally, having made all these changes to the document, it would be useful to be able to write the DOM back to the file from which it came. A utility method is included with xml.dom.minidom called
toprettyxml, which takes two optional arguments: an indentation string and a newline character. If not specified, these default to a tabulator and \n, respectively. This utility prints a DOM as nicely indented XML and is just the thing for printing back to the file.
Try It Out |
Working with XML Using SAX |
This example will show you how you can explore a document with SAX.
#!/usr/bin/python
from |
xml.sax |
import |
make_parser |
from |
xml.sax.handler |
import |
ContentHandler |
#begin bookHandler
class bookHandler(ContentHandler): inAuthor = False
inTitle = False
def startElement(self, name, attributes): if name == “book”:
print “*****Book*****”
if name == “title”: self.inTitle = True print “Title: “,
if name == “author”: self.inAuthor = True print “Author: “,
def endElement(self, name): if name == “title”:
self.inTitle = False if name == “author”:
self.inAuthor = False
292 |
TEAM LinG |
Using Python for XML
def characters(self, content):
if self.inTitle or self.inAuthor: print content
#end bookHandler
parser = make_parser() parser.setContentHandler(bookHandler()) parser.parse(“library.xml”)
How It Works
The xml.sax parser uses Handler objects to deal with events that occur during the parsing of a document. A handler may be a ContentHandler, a DTDHandler, an EntityResolver for handling entity references, or an ErrorHandler. A SAX application must implement handler classes, which conform to these interfaces and then set the handlers for the parser.
The ContentHandler interface contains methods that are triggered by document events, such as the start and end of elements and character data. When parsing character data, the parser has the option of returning it in one large block or several smaller whitespace-separated blocks, so the characters method may be called repeatedly for a single block of text.
The make_parser method creates a new parser object and returns it. The parser object created will be of the first parser type the system finds. The make_parser method can also take an optional argument consisting of a list of parsers to use, which must all implement the make_parser method. If a list is supplied, those parsers will be tried before the default list of parsers.
Intro to XSLT
XSLT stands for Extensible Stylesheet Language Transformations. Used for transforming XML into output formats such as HTML, it is a procedural, template-driven language.
XSLT Is XML
Like a Schema, XSLT is defined in terms of XML, and it’s being used to supplement the capabilities of XML. The XSLT namespace is “http://www.w3.org/1999/XSL/Transform”, which specifies the structure and syntax of the language. XSLT can be validated, like all other XML.
Transformation and Formatting Language
XSLT is used to transform one XML syntax into another or into any other text-based format. It is often used to transform XML into HTML in preparation for web presentation or a custom document model into XSL-FO for conversion into PDF.
Functional, Template-Driven
XSLT is a functional language, much like LISP. The XSLT programmer declares a series of templates, which are functions triggered when a node in the document matches an XPath expression. The
293
TEAM LinG
Chapter 15
programmer cannot guarantee the order of execution, so each function must stand on its own and make no assumptions about the results of other functions.
Using Python to Transform XML Using XSLT
Python doesn’t directly supply a way to create an XSLT, unfortunately. To transform XML documents, an XSLT must be created, and then it can be applied via Python to the XML.
In addition, Python’s core libraries don’t supply a method for transforming XML via XSLT, but a couple of different options are available from other libraries. Fourthought, Inc., offers an XSLT engine as part of its freely available 4Suite package. There are also Python bindings for the widely popular libxslt
C library.
The following example uses the latest version of the 4Suite library, which, as of this writing, is 1.0a4. If you don’t have the 4Suite library installed, please download it from http://4suite.org/index. xhtml. You will need it to complete the following exercises.
Try It Out |
Transforming XML with XSLT |
1.If you haven’t already, save the example XML file from the beginning of this chapter to a file called library.xml.
2.Cut and paste the following XSL from the wrox.com web site into a file called
HTMLLibrary.xsl:
<?xml version=”1.0”?> <xsl:stylesheet version=”1.0”
xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”> <xsl:template match=”/library”>
<html>
<head>
<xsl:value-of select=”@owner”/>’s Library </head>
<body>
<h1><xsl:value-of select=”@owner”/>’s Library</h1> <xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match=”book”> <xsl:apply-templates/> <br/>
</xsl:template>
<xsl:template match=”title”> <b><xsl:value-of select=”.”/></b> </xsl:template>
<xsl:template match=”author[1]”> by <xsl:value-of select=”.”/> </xsl:template>
294 |
TEAM LinG |
Using Python for XML
<xsl:template match=”author”> , <xsl:value-of select=”.”/>
</xsl:template>
</xsl:stylesheet>
3.Either type this or download it from the web site for this book and save it to a file called transformLibrary.py:
#!/usr/bin/python
from Ft.Xml import InputSource
from Ft.Xml.Xslt.Processor import Processor
#Open the XML and stylesheet as streams xml = open(‘library.xml’)
xsl = open(‘HTMLLibrary.xsl’)
#Parse the streams and build input sources from them
parsedxml = InputSource.DefaultFactory.fromStream(xml , “library.xml”)
parsedxsl = InputSource.DefaultFactory.fromStream(xsl, “HTMLLibrary.xsl”)
#Create a new processor and attach stylesheet, then transform XML processor = Processor()
processor.appendStylesheet(parsedxsl) HTML = processor.run(parsedxml)
#Write HTML out to a file
output = open(“library.html”, ‘w’) output.write(HTML)
output.close
4.Run python transformLibrary.py from the command line. This will create library.html.
5.Open library.html in a browser or text editor and look at the resulting web page.
How It Works
The first line of the stylesheet, <xsl:stylesheet version=”1.0” xmlns:xsl=”http://www.w3.org/ 1999/XSL/Transform”>, declares the document to be an XSL stylesheet that conforms to the specification at http://www.w3.org/1999/XSL/Transform and associates the xsl: prefix with that URI.
Each xsl:template element is triggered whenever a node that matches a certain XPath is encountered. For instance, <xsl:template match=”author[1]”> is triggered every time an <author> node is found that is the first in a list of authors.
XML tags that don’t start with the xsl: prefix are not parsed and are written to the output, as is plaintext in the body of a template. Therefore, the following template returns the skeleton of an HTML page, with a <head>, <body>, and an <h1> with the title of the library:
<xsl:template match=”/library”> <html>
<head>
<xsl:value-of select=”@owner”/>’s Library </head>
295
TEAM LinG
Chapter 15
<body>
<h1><xsl:value-of select=”@owner”/>’s Library</h1> <xsl:apply-templates/>
</body>
</html>
</xsl:template>
The xsl:value-of element returns the text value of an XPath expression. If the XPath selects more than one node, each node is converted to text according to XSL’s conversion rules and then concatenated and returned. <xsl:value-of select=”@owner”/>, for instance, will return the text value of the owner attribute on the current context node, which in this case is the <library> node. Because the attribute is a string, it will return John Q. Reader unchanged.
The xsl:apply-templates element is where the power of XSL occurs. When called with no arguments, it selects all child nodes of the current node, triggers the templates that match each of them, and then inserts the resulting nodes into the results of the current template. It can also be called with a select argument in the form of an XPath that will apply templates only to the nodes selected.
Putting It All Together : Working with RSS
Now that you’ve learned how to work with XML in Python, it’s time for a real-world example that shows you how you might want to use these modules to create your own RSS feed and how to take an RSS feed and turn it into a web page for reading.
RSS Overview and Vocabulary
Depending on who you ask, RSS stands for Really Simple Syndication, or Rich Site Summary, or RDF Site Summary. Regardless of what you want to call it, RSS is an XML-based format for syndicating content from news sites, blogs, and anyone else who wants to share discrete chunks of information over time. RSS’s usefulness lies in the ease with which content can be aggregated and republished. RSS makes it possible to read all your favorite authors’ blogs on a single web page, or, for example, to see every article from a news agency containing the word “Tanzania” first thing every day.
RSS originally started as part of Netscape’s news portal and has released several versions since then. After Netscape dropped development on RSS and released it to the public, two different groups began developing along what they each felt was the correct path for RSS to take. At present, one group has released a format they are calling RSS 1.0, and the other has released a format they are calling 2.0, despite the fact that 2.0 is not a successor to 1.0. At this point, RSS refers to seven different and sometimes incompatible formats, which can lead to a great deal of confusion for the newcomer to RSS.
Making Sense of It All
The following table summarizes the existing versions of RSS. As a content producer, the choice of version is fairly simple, but an RSS aggregator, which takes content from multiple sites and displays it in a single feed, has to handle all seven formats.
296 |
TEAM LinG |
|
|
|
Using Python for XML |
|
|
|
|
|
Version |
Owner |
Notes |
|
|
|
|
|
0.90. |
Netscape |
The original format. Netscape decided this format |
|
|
|
was overly complex and began work on 0.91 before |
|
|
|
dropping RSS development. Obsolete by 1.0. |
|
0.91. |
Userland |
Partially developed by Netscape before being picked |
|
|
|
up by Userland. Incredibly simple and still very |
|
|
|
popular, although officially obsolete by 2.0. |
|
0.92. |
Userland |
More complex than .91. Obsolete by 2.0. |
|
0.93. |
Userland |
More complex than .91. Obsolete by 2.0. |
|
0.94. |
Userland |
More complex than .91. Obsolete by 2.0. |
|
1.0. |
RSS-DEV Working |
RDF-based. Stable, but with modules still under |
|
|
Group |
development. Successor to 0.90. |
|
2.0. |
Userland |
Does not use RDF. Successor to 0.94. Stable, with |
|
|
|
modules still under development. |
|
|
|
|
RSS Vocabulary
RSS feeds are composed of documents called channels, which are feeds from a single web site. Each channel has a title, a link to the originating web site, a description, and a language. It also contains one or more items, which contain the actual content of the feed. An item must also have a title, a description, and a unique link back to the originating web site.
RSS 1.0 adds optional elements for richer content syndication, such as images, and a text input element for submitting information back to the parent site.
An RSS DTD
The DTD Netscape released for RSS 0.91 is freely available at http://my.netscape.com/publish/ formats/rss-0.91.dtd. It’s the simplest of the RSS document models, and it’s the one that will be used in the RSS examples in this chapter. To use it, include a DTD reference to that URI at the top of your XML file.
A Real-World Problem
With the increasing popularity of blogging, fueled by easy-to-use tools like Blogger and Moveable Type, it would be nice to be able to syndicate your blog out, so that other people could aggregate your posts into their portal pages. To do this, you’d like a script that reads your blogs and turns them into an RSS feed to which other people can then subscribe.
297
TEAM LinG
Chapter 15
Try It Out |
Creating an RSS Feed |
1.Either download the following from the web site for this book, or type it into a file called myblog.html:
<html>
<head>
<title>My Daily Blog</title> </head>
<body>
<h1>My Daily Blog</h1>
<p>This blog contains musings and news</p> <div class=”story”>
<a name=”autogen4”/>
<h2>Really Big Corp to buy Slightly Smaller Corp</h2> <div class=”date”>10:00 PM, 1/1/2005</div>
<span class=”content”>
Really Big Corp announced it’s intent today to buy Slightly Smaller Corp. Slightly Smaller Corp is the world’s foremost producer of lime green widgets. This will clearly impact the world’s widget supply.
</span>
</div>
<div class=”story”> <a name=”autogen3”/>
<h2>Python Code now easier than ever</h2> <div class=”date”>6:00 PM, 1/1/2005</div> <span class=”content”>
Writing Python has become easier than ever with the release of the new book, Beginning Python, from Wrox Press.
</span>
</div>
<div class=”story”> <a name=”autogen2”/>
<h2>Really Famous Author to speak at quirky little bookstore</h2> <div class=”date”>10:00 AM, 1/1/2005</div>
<span class=”content”>
A really good author will be speaking tomorrow night at a charming little bookstore in my home town. It’s a can’t miss event.
</span>
</div>
<div class=”story”> <a name=”autogen1”/>
<h2>Blogging more popular than ever</h2> <div class=”date”>2:00 AM, 1/1/2005</div> <span class=”content”>
More people are blogging now than ever before, leading to an explosion of opinions and timely content on the internet. It’s hard to say if this is good or bad, but it’s certainly a new method of communication.
</span>
</div>
</body>
</html>
298 |
TEAM LinG |
Using Python for XML
2.Type or download the following XSLT from the web site for this book into a file called
HTML2RSS.xsl:
<?xml version=”1.0”?> <xsl:stylesheet version=”1.0”
xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”>
<xsl:output method=”xml” doctype- system=”http://my.netscape.com/publish/formats/rss-0.91.dtd” doctype-public=”-//Netscape Communications//DTD RSS 0.91//EN”/>
<xsl:template match=”/”> <rss version=”0.91”> <channel>
<xsl:apply-templates select=”html/head/title”/> <link>http://server.mydomain.tld</link>
<description>This is my blog. There are others like it, but this one is mine.</description>
<xsl:apply-templates select=”html/body/div[@class=’story’]”/> </channel>
</rss>
</xsl:template>
<xsl:template match=”head/title”> <title>
<xsl:apply-templates/> </title> </xsl:template>
<xsl:template match=”div[@class=’story’]”> <item>
<xsl:apply-templates/> <link>
http://server.mydomain.tld/myblog.html#<xsl:value-of select=”a/@name”/> </link>
</item>
</xsl:template>
<xsl:template match=”h2”> <title><xsl:apply-templates/></title> </xsl:template>
<xsl:template match=”div[@class=’story’]/span[@class=’content’]”> <description>
<xsl:apply-templates/> </description> </xsl:template>
<xsl:template match=”div[@class=’date’]”/> </xsl:stylesheet>
299
TEAM LinG
Chapter 15
3.The same instructions go for this file – either type it in, or download it from the web site for the book into a file called HTML2RSS.py:
#!/usr/bin/python
from Ft.Xml import InputSource
from Ft.Xml.Xslt.Processor import Processor from xml.parsers.xmlproc import xmlval
class docErrorHandler(xmlval.ErrorHandler): def warning(self, message):
print message
def error(self, message): print message
def fatal(self, message): print message
#Open the stylesheet as a stream html = open(‘myblog.html’)
xsl = open(‘HTML2RSS.xsl’)
#Parse the streams and build input sources from them
parsedxml = InputSource.DefaultFactory.fromStream(html, “myblog.html”) parsedxsl = InputSource.DefaultFactory.fromStream(xsl, “HTML2RSS.xsl”)
#Create a new processor and attach stylesheet, then transform XML processor = Processor()
processor.appendStylesheet(parsedxsl) HTML = processor.run(parsedxml)
#Write RSS out to a file
output = open(“rssfeed.xml”, ‘w’) output.write(HTML)
output.close
#validate the RSS produced parser=xmlval.XMLValidator() parser.set_error_handler(docErrorHandler(parser)) parser.parse_resource(“rssfeed.xml”)
How It Works
Similarly to the XSLT example, this example opens a document and an XSLT, creates a processor, and uses the processor to run the XSLT on the source document. This is slightly different, however. The document being transformed is HTML. However, any XHTML-compliant document can be transformed, just like any other kind of XML.
Creating the Document
There’s an additional line in the XSL this time, one that reads <xsl:output method=”xml” doctype-
system=”http://my.netscape.com/publish/formats/rss-0.91.dtd” doctype-public=”- //Netscape Communications//DTD RSS 0.91//EN”/> . The xsl:output element is used to control the format of the output document. It can be used to output HTML instead of XML, and it can also be used to set the doctype of the resulting document. In this case, the doctype is being set to
300 |
TEAM LinG |