Peck, Jon schrieb: > Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so > there is a null byte with each nonzero character. I.e., > \x00?\x00x\x00m\x00l\x00 > > Here is something weird I found while experimenting with ElementTree with > this same XML string. > > Consider the same XML as a Python Unicode string, so it is actually encoded > as utf-16 and as a string containing utf-16 bytes. That is > u'<?xml version="1.0" encoding="UTF-16" st' ... > or > '\xff\xfe<\x00?\x00x\x00m\x00l\x00 > \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'... > > So if these are x and y > y = x.encode("utf-16") > > The actual bytes would be the same, I think, although y is type str and x is > type unicode.
No. The internal representation of unicode characters is platform dependent, and is either 2 or 4 bytes per character. If you want UTF-16, use ".encode()". > xml.sax.parseString documentation says > > parses from a buffer string received as a parameter, > > so one might imagine that either x or y would be acceptable, and the bytes > would be interpreted according to the encoding declaration in the byte stream. > > And, in fact, both do work with xml.sax.parseString (at least for me). With > etree.parse(StringIO.StringIO...) though, only the str form works. Don't try. Serialised XML is bytes, not characters. Stefan -- http://mail.python.org/mailman/listinfo/python-list