Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Python (2005)

.pdf
Скачиваний:
158
Добавлен:
17.08.2013
Размер:
15.78 Mб
Скачать

Using Python for XML

#Get all the book elements in the library books = myLibrary.getElementsByTagName(“book”)

#Print each book’s title and author(s) printLibrary(myLibrary)

#Insert a new book in the library newBook = myDoc.createElement(“book”)

newBookTitle = myDoc.createElement(“title”) titleText = myDoc.createTextNode(“Beginning Python”) newBookTitle.appendChild(titleText) newBook.appendChild(newBookTitle)

newBookAuthor = myDoc.createElement(“author”) authorName = myDoc.createTextNode(“Peter Norton, et al”) newBookAuthor.appendChild(authorName) newBook.appendChild(newBookAuthor)

myLibrary.appendChild(newBook)

print “Added a new book!” printLibrary(myLibrary)

#Remove a book from the library #Find ellison book

for book in myLibrary.getElementsByTagName(“book”): for author in book.getElementsByTagName(“author”):

if author.childNodes[0].data.find(“Ellison”) != -1: removedBook= myLibrary.removeChild(book) removedBook.unlink()

print “Removed a book.” printLibrary(myLibrary)

#Write back to the library file lib = open(“library.xml”, ‘w’) lib.write(myDoc.toprettyxml(“ “)) lib.close()

3.Run the file with python xml_minidom.py.

How It Works

To create a DOM, the document needs to be parsed into a document tree. This is accomplished by calling the parse method from xml.dom.minidom. This method returns a Document object,

which contains methods for querying for child nodes, getting all nodes in the document of a certain name, and creating new nodes, among other things. The getElementsByTagName method returns a list of Node objects whose names match the argument, which is used to extract the root node of the document, the <library> node. The print method uses getElementsByTagName again, and then for each book node, prints the title and author. Nodes with text that follows them are considered to have a single child node, and the text is stored in the data attribute of that node, so book.getElementsByTagName(“title”)[0].childNodes[0].data simply retrieves the text node below the <title> element and returns its data as a string.

291

TEAM LinG

Chapter 15

Constructing a new node in DOM requires creating a new node as a piece of the Document object, adding all necessary attributes and child nodes, and then attaching it to the correct node in the document tree. The createElement(tagName) method of the Document object creates a new node with a tag name set to whatever argument has been passed in. Adding text nodes is accomplished almost the same way, with a call to createTextNode(string). When all the nodes have been created, the structure is created by calling the appendChild method of the node to which the newly created node will be attached. Node also has a method called insertBefore(newChild, refChild) for inserting nodes in an arbitrary location in the list of child nodes, and replaceChild(newChild, oldChild) to replace one node with another.

Removing nodes requires first getting a reference to the node being removed and then a call to removeChild(childNode). After the child has been removed, it’s advisable to call unlink() on it to force garbage collection for that node and any children that may still be attached. This method is specific to the minidom implementation and is not available in xml.dom.

Finally, having made all these changes to the document, it would be useful to be able to write the DOM back to the file from which it came. A utility method is included with xml.dom.minidom called

toprettyxml, which takes two optional arguments: an indentation string and a newline character. If not specified, these default to a tabulator and \n, respectively. This utility prints a DOM as nicely indented XML and is just the thing for printing back to the file.

Try It Out

Working with XML Using SAX

This example will show you how you can explore a document with SAX.

#!/usr/bin/python

from

xml.sax

import

make_parser

from

xml.sax.handler

import

ContentHandler

#begin bookHandler

class bookHandler(ContentHandler): inAuthor = False

inTitle = False

def startElement(self, name, attributes): if name == “book”:

print “*****Book*****”

if name == “title”: self.inTitle = True print “Title: “,

if name == “author”: self.inAuthor = True print “Author: “,

def endElement(self, name): if name == “title”:

self.inTitle = False if name == “author”:

self.inAuthor = False

292

TEAM LinG

Using Python for XML

def characters(self, content):

if self.inTitle or self.inAuthor: print content

#end bookHandler

parser = make_parser() parser.setContentHandler(bookHandler()) parser.parse(“library.xml”)

How It Works

The xml.sax parser uses Handler objects to deal with events that occur during the parsing of a document. A handler may be a ContentHandler, a DTDHandler, an EntityResolver for handling entity references, or an ErrorHandler. A SAX application must implement handler classes, which conform to these interfaces and then set the handlers for the parser.

The ContentHandler interface contains methods that are triggered by document events, such as the start and end of elements and character data. When parsing character data, the parser has the option of returning it in one large block or several smaller whitespace-separated blocks, so the characters method may be called repeatedly for a single block of text.

The make_parser method creates a new parser object and returns it. The parser object created will be of the first parser type the system finds. The make_parser method can also take an optional argument consisting of a list of parsers to use, which must all implement the make_parser method. If a list is supplied, those parsers will be tried before the default list of parsers.

Intro to XSLT

XSLT stands for Extensible Stylesheet Language Transformations. Used for transforming XML into output formats such as HTML, it is a procedural, template-driven language.

XSLT Is XML

Like a Schema, XSLT is defined in terms of XML, and it’s being used to supplement the capabilities of XML. The XSLT namespace is “http://www.w3.org/1999/XSL/Transform”, which specifies the structure and syntax of the language. XSLT can be validated, like all other XML.

Transformation and Formatting Language

XSLT is used to transform one XML syntax into another or into any other text-based format. It is often used to transform XML into HTML in preparation for web presentation or a custom document model into XSL-FO for conversion into PDF.

Functional, Template-Driven

XSLT is a functional language, much like LISP. The XSLT programmer declares a series of templates, which are functions triggered when a node in the document matches an XPath expression. The

293

TEAM LinG

Chapter 15

programmer cannot guarantee the order of execution, so each function must stand on its own and make no assumptions about the results of other functions.

Using Python to Transform XML Using XSLT

Python doesn’t directly supply a way to create an XSLT, unfortunately. To transform XML documents, an XSLT must be created, and then it can be applied via Python to the XML.

In addition, Python’s core libraries don’t supply a method for transforming XML via XSLT, but a couple of different options are available from other libraries. Fourthought, Inc., offers an XSLT engine as part of its freely available 4Suite package. There are also Python bindings for the widely popular libxslt

C library.

The following example uses the latest version of the 4Suite library, which, as of this writing, is 1.0a4. If you don’t have the 4Suite library installed, please download it from http://4suite.org/index. xhtml. You will need it to complete the following exercises.

Try It Out

Transforming XML with XSLT

1.If you haven’t already, save the example XML file from the beginning of this chapter to a file called library.xml.

2.Cut and paste the following XSL from the wrox.com web site into a file called

HTMLLibrary.xsl:

<?xml version=”1.0”?> <xsl:stylesheet version=”1.0”

xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”> <xsl:template match=”/library”>

<html>

<head>

<xsl:value-of select=”@owner”/>’s Library </head>

<body>

<h1><xsl:value-of select=”@owner”/>’s Library</h1> <xsl:apply-templates/>

</body>

</html>

</xsl:template>

<xsl:template match=”book”> <xsl:apply-templates/> <br/>

</xsl:template>

<xsl:template match=”title”> <b><xsl:value-of select=”.”/></b> </xsl:template>

<xsl:template match=”author[1]”> by <xsl:value-of select=”.”/> </xsl:template>

294

TEAM LinG

Using Python for XML

<xsl:template match=”author”> , <xsl:value-of select=”.”/>

</xsl:template>

</xsl:stylesheet>

3.Either type this or download it from the web site for this book and save it to a file called transformLibrary.py:

#!/usr/bin/python

from Ft.Xml import InputSource

from Ft.Xml.Xslt.Processor import Processor

#Open the XML and stylesheet as streams xml = open(‘library.xml’)

xsl = open(‘HTMLLibrary.xsl’)

#Parse the streams and build input sources from them

parsedxml = InputSource.DefaultFactory.fromStream(xml , “library.xml”)

parsedxsl = InputSource.DefaultFactory.fromStream(xsl, “HTMLLibrary.xsl”)

#Create a new processor and attach stylesheet, then transform XML processor = Processor()

processor.appendStylesheet(parsedxsl) HTML = processor.run(parsedxml)

#Write HTML out to a file

output = open(“library.html”, ‘w’) output.write(HTML)

output.close

4.Run python transformLibrary.py from the command line. This will create library.html.

5.Open library.html in a browser or text editor and look at the resulting web page.

How It Works

The first line of the stylesheet, <xsl:stylesheet version=”1.0” xmlns:xsl=”http://www.w3.org/ 1999/XSL/Transform”>, declares the document to be an XSL stylesheet that conforms to the specification at http://www.w3.org/1999/XSL/Transform and associates the xsl: prefix with that URI.

Each xsl:template element is triggered whenever a node that matches a certain XPath is encountered. For instance, <xsl:template match=”author[1]”> is triggered every time an <author> node is found that is the first in a list of authors.

XML tags that don’t start with the xsl: prefix are not parsed and are written to the output, as is plaintext in the body of a template. Therefore, the following template returns the skeleton of an HTML page, with a <head>, <body>, and an <h1> with the title of the library:

<xsl:template match=”/library”> <html>

<head>

<xsl:value-of select=”@owner”/>’s Library </head>

295

TEAM LinG

Chapter 15

<body>

<h1><xsl:value-of select=”@owner”/>’s Library</h1> <xsl:apply-templates/>

</body>

</html>

</xsl:template>

The xsl:value-of element returns the text value of an XPath expression. If the XPath selects more than one node, each node is converted to text according to XSL’s conversion rules and then concatenated and returned. <xsl:value-of select=”@owner”/>, for instance, will return the text value of the owner attribute on the current context node, which in this case is the <library> node. Because the attribute is a string, it will return John Q. Reader unchanged.

The xsl:apply-templates element is where the power of XSL occurs. When called with no arguments, it selects all child nodes of the current node, triggers the templates that match each of them, and then inserts the resulting nodes into the results of the current template. It can also be called with a select argument in the form of an XPath that will apply templates only to the nodes selected.

Putting It All Together : Working with RSS

Now that you’ve learned how to work with XML in Python, it’s time for a real-world example that shows you how you might want to use these modules to create your own RSS feed and how to take an RSS feed and turn it into a web page for reading.

RSS Overview and Vocabulary

Depending on who you ask, RSS stands for Really Simple Syndication, or Rich Site Summary, or RDF Site Summary. Regardless of what you want to call it, RSS is an XML-based format for syndicating content from news sites, blogs, and anyone else who wants to share discrete chunks of information over time. RSS’s usefulness lies in the ease with which content can be aggregated and republished. RSS makes it possible to read all your favorite authors’ blogs on a single web page, or, for example, to see every article from a news agency containing the word “Tanzania” first thing every day.

RSS originally started as part of Netscape’s news portal and has released several versions since then. After Netscape dropped development on RSS and released it to the public, two different groups began developing along what they each felt was the correct path for RSS to take. At present, one group has released a format they are calling RSS 1.0, and the other has released a format they are calling 2.0, despite the fact that 2.0 is not a successor to 1.0. At this point, RSS refers to seven different and sometimes incompatible formats, which can lead to a great deal of confusion for the newcomer to RSS.

Making Sense of It All

The following table summarizes the existing versions of RSS. As a content producer, the choice of version is fairly simple, but an RSS aggregator, which takes content from multiple sites and displays it in a single feed, has to handle all seven formats.

296

TEAM LinG

 

 

 

Using Python for XML

 

 

 

 

 

Version

Owner

Notes

 

 

 

 

 

0.90.

Netscape

The original format. Netscape decided this format

 

 

 

was overly complex and began work on 0.91 before

 

 

 

dropping RSS development. Obsolete by 1.0.

 

0.91.

Userland

Partially developed by Netscape before being picked

 

 

 

up by Userland. Incredibly simple and still very

 

 

 

popular, although officially obsolete by 2.0.

 

0.92.

Userland

More complex than .91. Obsolete by 2.0.

 

0.93.

Userland

More complex than .91. Obsolete by 2.0.

 

0.94.

Userland

More complex than .91. Obsolete by 2.0.

 

1.0.

RSS-DEV Working

RDF-based. Stable, but with modules still under

 

 

Group

development. Successor to 0.90.

 

2.0.

Userland

Does not use RDF. Successor to 0.94. Stable, with

 

 

 

modules still under development.

 

 

 

 

RSS Vocabulary

RSS feeds are composed of documents called channels, which are feeds from a single web site. Each channel has a title, a link to the originating web site, a description, and a language. It also contains one or more items, which contain the actual content of the feed. An item must also have a title, a description, and a unique link back to the originating web site.

RSS 1.0 adds optional elements for richer content syndication, such as images, and a text input element for submitting information back to the parent site.

An RSS DTD

The DTD Netscape released for RSS 0.91 is freely available at http://my.netscape.com/publish/ formats/rss-0.91.dtd. It’s the simplest of the RSS document models, and it’s the one that will be used in the RSS examples in this chapter. To use it, include a DTD reference to that URI at the top of your XML file.

A Real-World Problem

With the increasing popularity of blogging, fueled by easy-to-use tools like Blogger and Moveable Type, it would be nice to be able to syndicate your blog out, so that other people could aggregate your posts into their portal pages. To do this, you’d like a script that reads your blogs and turns them into an RSS feed to which other people can then subscribe.

297

TEAM LinG

Chapter 15

Try It Out

Creating an RSS Feed

1.Either download the following from the web site for this book, or type it into a file called myblog.html:

<html>

<head>

<title>My Daily Blog</title> </head>

<body>

<h1>My Daily Blog</h1>

<p>This blog contains musings and news</p> <div class=”story”>

<a name=”autogen4”/>

<h2>Really Big Corp to buy Slightly Smaller Corp</h2> <div class=”date”>10:00 PM, 1/1/2005</div>

<span class=”content”>

Really Big Corp announced it’s intent today to buy Slightly Smaller Corp. Slightly Smaller Corp is the world’s foremost producer of lime green widgets. This will clearly impact the world’s widget supply.

</span>

</div>

<div class=”story”> <a name=”autogen3”/>

<h2>Python Code now easier than ever</h2> <div class=”date”>6:00 PM, 1/1/2005</div> <span class=”content”>

Writing Python has become easier than ever with the release of the new book, Beginning Python, from Wrox Press.

</span>

</div>

<div class=”story”> <a name=”autogen2”/>

<h2>Really Famous Author to speak at quirky little bookstore</h2> <div class=”date”>10:00 AM, 1/1/2005</div>

<span class=”content”>

A really good author will be speaking tomorrow night at a charming little bookstore in my home town. It’s a can’t miss event.

</span>

</div>

<div class=”story”> <a name=”autogen1”/>

<h2>Blogging more popular than ever</h2> <div class=”date”>2:00 AM, 1/1/2005</div> <span class=”content”>

More people are blogging now than ever before, leading to an explosion of opinions and timely content on the internet. It’s hard to say if this is good or bad, but it’s certainly a new method of communication.

</span>

</div>

</body>

</html>

298

TEAM LinG

Using Python for XML

2.Type or download the following XSLT from the web site for this book into a file called

HTML2RSS.xsl:

<?xml version=”1.0”?> <xsl:stylesheet version=”1.0”

xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”>

<xsl:output method=”xml” doctype- system=”http://my.netscape.com/publish/formats/rss-0.91.dtd” doctype-public=”-//Netscape Communications//DTD RSS 0.91//EN”/>

<xsl:template match=”/”> <rss version=”0.91”> <channel>

<xsl:apply-templates select=”html/head/title”/> <link>http://server.mydomain.tld</link>

<description>This is my blog. There are others like it, but this one is mine.</description>

<xsl:apply-templates select=”html/body/div[@class=’story’]”/> </channel>

</rss>

</xsl:template>

<xsl:template match=”head/title”> <title>

<xsl:apply-templates/> </title> </xsl:template>

<xsl:template match=”div[@class=’story’]”> <item>

<xsl:apply-templates/> <link>

http://server.mydomain.tld/myblog.html#<xsl:value-of select=”a/@name”/> </link>

</item>

</xsl:template>

<xsl:template match=”h2”> <title><xsl:apply-templates/></title> </xsl:template>

<xsl:template match=”div[@class=’story’]/span[@class=’content’]”> <description>

<xsl:apply-templates/> </description> </xsl:template>

<xsl:template match=”div[@class=’date’]”/> </xsl:stylesheet>

299

TEAM LinG

Chapter 15

3.The same instructions go for this file – either type it in, or download it from the web site for the book into a file called HTML2RSS.py:

#!/usr/bin/python

from Ft.Xml import InputSource

from Ft.Xml.Xslt.Processor import Processor from xml.parsers.xmlproc import xmlval

class docErrorHandler(xmlval.ErrorHandler): def warning(self, message):

print message

def error(self, message): print message

def fatal(self, message): print message

#Open the stylesheet as a stream html = open(‘myblog.html’)

xsl = open(‘HTML2RSS.xsl’)

#Parse the streams and build input sources from them

parsedxml = InputSource.DefaultFactory.fromStream(html, “myblog.html”) parsedxsl = InputSource.DefaultFactory.fromStream(xsl, “HTML2RSS.xsl”)

#Create a new processor and attach stylesheet, then transform XML processor = Processor()

processor.appendStylesheet(parsedxsl) HTML = processor.run(parsedxml)

#Write RSS out to a file

output = open(“rssfeed.xml”, ‘w’) output.write(HTML)

output.close

#validate the RSS produced parser=xmlval.XMLValidator() parser.set_error_handler(docErrorHandler(parser)) parser.parse_resource(“rssfeed.xml”)

How It Works

Similarly to the XSLT example, this example opens a document and an XSLT, creates a processor, and uses the processor to run the XSLT on the source document. This is slightly different, however. The document being transformed is HTML. However, any XHTML-compliant document can be transformed, just like any other kind of XML.

Creating the Document

There’s an additional line in the XSL this time, one that reads <xsl:output method=”xml” doctype-

system=”http://my.netscape.com/publish/formats/rss-0.91.dtd” doctype-public=”- //Netscape Communications//DTD RSS 0.91//EN”/> . The xsl:output element is used to control the format of the output document. It can be used to output HTML instead of XML, and it can also be used to set the doctype of the resulting document. In this case, the doctype is being set to

300

TEAM LinG