Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Python (2005)

.pdf
Скачиваний:
159
Добавлен:
17.08.2013
Размер:
15.78 Mб
Скачать

Using Python for XML

http://my.netscape.com/publish/formats/rss-0.91.dtd, which means the document can be validated after it’s produced to make sure that the resulting RSS is correct.

The stylesheet selects the title of the web page as the title of the RSS feed and creates a description for it, and then pulls story content from the body of the document. To make the example less complex, the HTML has been marked up with div tags to separate stories, but that isn’t strictly necessary.

Checking It Against the DTD

As in the validation example, a validating parser is being created, and an ErrorHandler class is being created. The result document already has the document type set, so all that’s required to validate it is to parse it with a validating parser and then print any errors encountered with the validation.

Another Real-World Problem

Now that you’ve started publishing your own content, it would be nice to look at everyone else’s while you’re at it. If you built your own aggregator, then you could create a personalized web page of the news feeds you like to read.

Try It Out

Creating An Aggregator

1.Type or download the following into a file called RSS2HTML.xsl:

<?xml version=”1.0”?> <xsl:stylesheet version=”1.0”

xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”>

<xsl:template match=”/”> <html>

<head>

<title>

My Personal News Feed </title>

</head>

<body>

<h1>My Personal News Feed</h1> <xsl:apply-templates select=”//channel/item[1]”/> </body>

</html>

</xsl:template>

<xsl:template match=”item”> <xsl:apply-templates/> </xsl:template>

<xsl:template match=”title”> <h2><xsl:value-of select=”.”/></h2> </xsl:template>

<xsl:template match=”description”> <xsl:apply-templates/> </xsl:template>

301

TEAM LinG

Chapter 15

<xsl:template match=”link”> <a>

<xsl:attribute name=”href”> <xsl:value-of select=”.”/> </xsl:attribute> <xsl:value-of select=”.”/> </a>

</xsl:template>

</xsl:stylesheet>

2.Download or type the following to a file called RSS2HTML.py:

#!/usr/bin/python

from Ft.Xml import InputSource

from Ft.Xml.Xslt.Processor import Processor

#Open the stylesheet as a stream xsl = open(‘RSS2HTML.xsl’)

#Parse the streams and build input sources from them parsedxml =

InputSource.DefaultFactory.fromUri(“http://www.newscientist.com/feed.ns?index=marsrovers&type=xml “)

parsedxsl = InputSource.DefaultFactory.fromStream(xsl, “RSS2HTML.xsl”)

#Create a new processor and attach stylesheet, then transform XML processor = Processor()

processor.appendStylesheet(parsedxsl) HTML = processor.run(parsedxml)

#Write HTML out to a file

output = open(“aggregator.html”, ‘w’) output.write(HTML)

output.close

3.Run python RSS2HTML.py. Then open aggregator.html in a browser or text editor and view the resulting HTML.

How It Works

The example RSS feed is a 0.91 RSS feed, for simplicity’s sake. Much like the example for using XSLTs, the Python script opens and parses a feed, parses the XSL to be applied, and then creates a processor and associates the stylesheet with it and processes the contents of the feed. In this case, however, the script is processing a feed from a URL using InputSource.DefaultFactory.fromUri. Fortunately, the module takes care of all of the details of getting the data from the remote server. You simply need to specify the URL for the feed and have a working Internet connection.

302

TEAM LinG

Using Python for XML

Summar y

In this chapter, you’ve learned the following:

How to parse XML using both SAX and DOM

How to validate XML using xmlproc

How to transform XML with XSLT

How to parse HTML using either HTMLParser or htmllib

How to manipulate RSS using Python

In Chapter 16, you learn more about network programming and e-mail. Before proceeding, however, try the exercises that follow to test your understanding of the material covered in this chapter. You can find the solutions to these exercises in Appendix A.

Exercises

1.Given the following configuration file for a Python application, write some code to extract the configuration information using a DOM parser:

<?xml version=”1.0”?>

<!DOCTYPE config SYSTEM “configfile.dtd”> <config>

<utilitydirectory>/usr/bin</utilitydirectory>

<utility>grep</utility>

<mode>recursive</mode>

</config>

2.Given the following DTD, named configfile.dtd, write a Python script to validate the previous configuration file:

<!ELEMENT config

(utilitydirectory, utility, mode)>

<!ELEMENT utilitydirectory

(#PCDATA)*>

<!ELEMENT

utility

(#PCDATA)*>

<!ELEMENT

mode (#PCDATA)*>

 

3.Use SAX to extract configuration information from the preceding config file instead of DOM.

303

TEAM LinG

TEAM LinG

16

Network Programming

For more than a decade at the time this book is being written, one of the main reasons driving the purchase of personal computers is the desire to get online: to connect in various ways to other computers throughout the world. Network connectivity — specifically, Internet connectivity — is the “killer app” for personal computing, the feature that got a computer-illiterate general population to start learning about and buying personal computers en masse.

Without networking, you can do amazing things with a computer, but your audience is limited to the people who can come over to look at your screen or who can read the printouts or load the CD’s and floppy disks you distribute. Connect the same computer to the Internet and you can communicate across town or across the world.

The Internet’s architecture supports an unlimited number of applications, but it boasts two killer apps of its own — two applications that people get online just to use. One is, of course, the incredibly popular World Wide Web; which is covered in Chapter 21, “Web Applications and Web Services.”

The Internet’s other killer app is e-mail, which is covered in depth in this chapter. Here, you’ll use standard Python libraries to write applications that compose, send, and receive e-mail. Then, for those who dream of writing their own killer app, you’ll write some programs that use the Internet to send and receive data in custom formats.

Try It Out

Sending Some E-mail

Jamie Zawinski, one of the original Netscape programmers, has famously remarked, “Every program attempts to expand until it can read mail.” This may be true (it certainly was of the Netscape browser even early on when he worked on it), but long before your program becomes a mail reader, you’ll probably find that you need to make it send some mail. Mail readers are typically end-user applications, but nearly any kind of application can have a reason to send mail: monitoring software, automation scripts, web applications, even games. E-mail is the time-honored way of sending automatic notifications, and automatic notifications can happen in a wide variety of contexts.

Python provides a sophisticated set of classes for constructing e-mail messages, which are covered a bit later. Actually, an e-mail message is just a string in a predefined format. All you need to send

TEAM LinG

Chapter 16

an e-mail message is a string in that format, an address to send the mail to, and Python’s smtplib module. Here’s a very simple Python session that sends out a bare-bones e-mail message:

>>>fromAddress = ‘sender@example.com’

>>>toAddress = ‘me@my.domain’

>>>msg = “Subject: Hello\n\nThis is the body of the message.”

>>>import smtplib

>>>server = smtplib.SMTP(“localhost”, 25)

>>>server.sendmail(fromAddress, toAddress, msg)

{}

smtplib takes its name from SMTP, the Simple Mail Transport Protocol. That’s the protocol, or standard, defined for sending Internet mail. As you’ll see, Python comes packaged with modules that help you speak many Internet protocols, and the module is always named after the protocol: imaplib, poplib, httplib, ftplib, and so on.

Put your own e-mail address in me@mydomain, and if you’ve got a mail server running on your machine, you should be able to send mail to yourself, as shown in Figure 16-1.

Figure 16-1

However, you probably don’t have a mail server running on your machine. (You might have one if you’re running these scripts on a shared computer, or if you set the mail server up yourself, in which case you probably already know a bit about networking and are impatiently waiting for the more advanced parts of this chapter.) If there’s no mail server on the machine where you run this script, you’ll get an exception when you try to instantiate the remote SMTP mail server object, something similar to this:

Traceback (most recent call last):

File “<stdin>”, line 1, in ?

File “/usr/lib/python2.4/smtplib.py”, line 241, in __init__

(code, msg) = self.connect(host, port)

File “/usr/lib/python2.4/smtplib.py”, line 303, in connect raise socket.error, msg

socket.error: (111, ‘Connection refused’)

What’s going on here? Look at the line that caused the exception:

>>> server = smtplib.SMTP(“localhost”, 25)

The constructor for the smtplib class is trying to start up a network connection using IP, the Internet Protocol. The string “localhost” and the number 25 identify the Internet location of the putative mail server. Because you’re not running a mail server, there’s nothing at the other end of the connection, and when Python discovers this fact, it can’t continue.

306

TEAM LinG

Network Programming

To understand the mystical meanings of “localhost” and 25, it helps to know a little about protocols, and the Internet Protocol in particular.

Understanding Protocols

A protocol is a convention for structuring the data sent between two or more parties on a network. It’s analogous to the role of protocol or etiquette in relationships between humans. For instance, suppose that you wanted to go out with friends to dinner or get married to someone. Each culture has defined conventions describing the legal and socially condoned behavior in such situations. When you go out for dinner, there are conventions about how to behave in a restaurant, how to use the eating utensils, and how to pay. Marriages are carried out according to conventions regarding rituals and contracts, conventions that can be very elaborate.

These two activities are very different, but the same lower-level social protocols underlie both of them. These protocols set standards for things such as politeness and the use of a mutually understood language. On the lowest level, you may be vibrating your vocal cords in a certain pattern, but on a

higher level you’re finalizing your marriage by saying “I do.” Violate a lower-level protocol (say, by acting rudely in the restaurant) and your chances of carrying out your high-level goal can be compromised. All of these aspects of protocols for human behavior have their correspondence in protocols for computer networking.

Comparing Protocols and Programming Languages

Thousands of network protocols for every imaginable purpose have been invented over the past few decades; it might be said that the history of networking is the history of protocol design. Why so many protocols? To answer this question, consider another analogy to the world of network protocols: Why so many programming languages? Network protocols have the same types of interrelation as programming languages, and people create new protocols for the same reasons they create programming languages.

Different programming languages have been designed for different purposes. It would be madness to write a word processor in the FORTRAN language, not because FORTRAN is objectively “bad,” but because it was designed for mathematical and scientific research, not end-user GUI applications.

Similarly, different protocols are intended for different purposes. SMTP, the protocol you just got a brief look at, could be used for all sorts of things besides sending mail. No one does this because it makes more sense to use SMTP for the purpose for which it was designed, and use other protocols for other purposes.

A programming language may be created to compete with others in the same niche. The creator of a new language may see technical or aesthetic flaws in existing languages and want to make their own tasks easier. A language author may covet the riches and fame that come with being the creator of a popular language. A person may invent a new protocol because they’ve come up with a new type of application that requires one.

Some programming languages are designed specifically for teaching students how to program, or, at the other end of programming literacy, how to write compilers. Some languages are designed to explore new ideas, not for real use, and other languages are created as a competitive tool by one company for use against another company.

307

TEAM LinG

Chapter 16

These factors also come into play in protocol design. Companies sometimes invent new, incompatible protocols to try to take business from a competitor. Some protocols are intended only for pedagogical purposes. For instance, this chapter will, under the guise of teaching network programming, design protocols for things like online chat rooms. There are already perfectly good protocols for this, but they’re too complex to be given a proper treatment in the available space.

The ADA programming language was defined by the U.S. Department of Defense to act as a common language across all military programming projects. The Internet Protocol was created to enable multiple previously incompatible networks to communicate with one another (hence the name “Internet”).

Nowadays, even internal networks (intranets) usually run atop the Internet Protocol, but the old motives (the solving of new problems, competition, and so on) remain in play at higher and lower levels, which brings us to the most interesting reason for the proliferation of programming languages and protocols.

The Internet Protocol Stack

Different programming languages operate at different levels of abstraction. Python is a very high-level language capable of all kinds of tasks, but the Python interpreter itself isn’t written in Python: It’s written in C, a lower-level language. C, in turn, is compiled into a machine language specific to your computer architecture. Whenever you type a statement into a Python interpreter, there is a chain of abstraction reaching down to the machine code, and even lower to the operation of the digital circuits that actually drive the computer.

There’s a Python interpreter written in Java (Jython), but Java is written in C. PyPy is a project that aims to implement a Python interpreter in Python, but PyPy runs on top of the C or Java implementation. You can’t escape C!

In one sense, when you type a statement into the Python interpreter, the computer simply “does what you told it to.” In another, it runs the Python statement you typed. In a third sense, it runs a longer series of C statements, written by the authors of Python and merely activated by your Python statement. In a fourth sense, the computer runs a very long, nearly incomprehensible series of machine code statements. In a fifth, it doesn’t “run” any program at all: You just cause a series of timed electrical impulses to be sent through the hardware. The reason we have high-level programming languages is because they’re easier to use than the lower-level ones. That doesn’t make lower-level languages superfluous, though.

English is a very high-level human language capable of all kinds of tasks, but one can’t speak English just by “speaking English.” To speak English, one must actually make some noises, but a speaker can’t just “make some noises” either: We have to send electrical impulses from our brains that force air out of the lungs and constantly reposition the tongues and lips. It’s a very complicated process, but we don’t even think about the lower levels, only the words we’re saying and the concepts we’re trying to convey.

The soup of network protocols can be grouped into a similar hierarchical structure based on levels of abstraction, or layers. On the physical layer, the lowest level, it’s all just electrical impulses and EM radiation. Just above the physical layer, every type of network hardware needs its own protocol, implemented in software (for instance, the Ethernet protocol for networks that run over LAN wires). The electromagnetic phenomena of the physical layer can now be seen as the sending and receiving of bits from one device to another. This is called the data link layer. As you go up the protocol stack, these raw bits take on meaning: They become routing instructions, commands, responses, images, web pages.

308

TEAM LinG

Network Programming

Because different pieces of hardware communicate in different ways, connecting (for example) an Ethernet network to a wireless network requires a protocol that works on a higher level then the data link layer. As mentioned earlier, the common denominator for most networks nowadays is the Internet Protocol (IP), which implements the network layer and connects all those networks together. IP works on the network layer.

Directly atop the network layer is the transport layer, which makes sure the information sent over IP gets to its destination reliably, in the right order, and without errors. IP doesn’t care about reliability or error-checking: It just takes some data and a destination address, sends it across the network, and assumes it gets to that address intact.

TCP, the Transmission Control Protocol, does care about these things. TCP implements the transport layer of the protocol stack, making reliable, orderly communication possible between two points on the network. It’s so common to stack TCP on top of IP that the two protocols are often treated as one and given a unified name, TCP/IP.

All of the network protocols you’ll study and design in this chapter are based on top of TCP/IP. These protocols are at the application layer and are designed to solve specific user problems. Some of these protocols are known by name even to nonprogrammers: You may have heard of HTTP, FTP, BitTorrent, and so on.

When people think of designing protocols, they usually think of the application layer, the one best suited to Python implementations. The other current field of interest is at the other end in the data link layer: embedded systems programming for connecting new types of devices to the Internet. Thanks to the overwhelming popularity of the Internet, TCP/IP has more or less taken over the middle of the protocol stack.

A Little Bit About the Internet Protocol

Now that you understand where the Internet Protocol fits into the protocol stack your computer uses, there are only two things you really need to know about it: addresses and ports.

Internet Addresses

Each computer on the Internet (or on a private TCP/IP network) has one or more IP addresses, usually represented as a dotted series of four numbers, like “208.215.179.178.” That same computer may also have one or more hostnames, which look like “wrox.com.”

To connect to a service running on a computer, you need to know its IP address or one of its hostnames. (Hostnames are managed by DNS, a protocol that runs on top of TCP/IP and silently turns hostnames into IP addresses). Recall the script at the beginning of this chapter that sent out mail. When it tried to connect to a mail server, it mentioned the seemingly magic string “localhost”:

>>> server = smtplib.SMTP(“localhost”, 25)

“localhost” is a special hostname that always refers to the computer you’re using when you mention it (each computer also has a special IP address that does the same thing: 127.0.0.1). The hostname is how you tell Python where on the Internet to find your mail server.

309

TEAM LinG

Chapter 16

It’s generally better to use hostnames instead of IP addresses, even though the former immediately gets turned into the latter. Hostnames tend to be more stable over time than IP addresses. Another example of the protocol stack in action: The DNS protocol serves to hide the low-level details of IP’s addressing scheme.

Of course, if you don’t run a mail server on your computer, “localhost” won’t work. The organization that gives you Internet access should be letting you use their mail server, possibly located at mail.[your ISP].com or smtp.[your ISP].com. Whatever mail client you use, it probably has the hostname of a mail server somewhere in its configuration, so that you can use it to send out mail. Substitute that for “localhost” in the example code listed previously, and you should be able to send mail from Python:

>>>fromAddress = ‘sender@example.com’

>>>toAddress = ‘[your email address]’

>>>msg = “Subject: Hello\n\nThis is the body of the message.”

>>>import smtplib

>>>server = smtplib.SMTP(“mail.[your ISP].com”, 25)

>>>server.sendmail(fromAddress, toAddress, msg)

{}

Unfortunately, you still might not be able to send mail, for any number of reasons. Your SMTP server might demand authentication, which this sample session doesn’t provide. It might not accept mail from the machine on which you’re running your script (try the same machine you normally use to send mail). It might be running on a nonstandard port (see below). The server might not like the format of this bare-bones message, and expect something more like a “real” e-mail message; if so, the email module described in the following section might help. If all else fails, ask your system administrator for help.

Internet Ports

The string “localhost” has been explained as a DNS hostname that masks an IP address. That leaves the mysterious number 25. What does it mean? Well, consider the fact that a single computer may host more than one service. A single machine with one IP address may have a web server, a mail server, a database server, and a dozen other servers. How should clients distinguish between an attempt to connect to the web server and an attempt to connect to the database server?

A computer that implements the Internet Protocol can expose up to 65536 numbered ports. When you start an Internet server (say, a web server), the server process “binds” itself to one or more of the ports on your computer (say, port 80, the conventional port for a web server) and begins listening for outside connections to that port. If you’ve ever seen a web site address that looked like “http://www.

example.com:8000/”, that number is the port number for the web server — in this case, a port number that violates convention. The enforcer of convention in this case is the Internet Assigned Numbers Authority.

The IANA list of protocols and conventional port numbers is published at www.iana.org/ assignments/port-numbers.

According to the IANA, the conventional port number for SMTP is 25. That’s why the constructor to the SMTP object in that example received 25 as its second argument (if you don’t specify a port number at all, the SMTP constructor will assume 25):

>>> server = smtplib.SMTP(“localhost”, 25)

310

TEAM LinG