Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so there is a null byte with each nonzero character. I.e., \x00?\x00x\x00m\x00l\x00
Here is something weird I found while experimenting with ElementTree with this same XML string. Consider the same XML as a Python Unicode string, so it is actually encoded as utf-16 and as a string containing utf-16 bytes. That is u'<?xml version="1.0" encoding="UTF-16" st' ... or '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'... So if these are x and y y = x.encode("utf-16") The actual bytes would be the same, I think, although y is type str and x is type unicode. xml.sax.parseString documentation says parses from a buffer string received as a parameter, so one might imagine that either x or y would be acceptable, and the bytes would be interpreted according to the encoding declaration in the byte stream. And, in fact, both do work with xml.sax.parseString (at least for me). With etree.parse(StringIO.StringIO...) though, only the str form works. Regards, Jon Peck -----Original Message----- From: Jeroen Ruigrok van der Werven [mailto:[EMAIL PROTECTED] Sent: Saturday, February 02, 2008 12:57 AM To: JKPeck Cc: python-list@python.org Subject: Re: Mysterious xml.sax Encoding Exception -On [20080201 19:06], JKPeck ([EMAIL PROTECTED]) wrote: >In both of these cases, there are only plain, 7-bit ascii characters >in the xml, and it really is valid utf-16 as far as I can tell. Did you mean to say that the only characters they used in the UTF-16 encoded file are characters from the Basic Latin Unicode block? -- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ We have met the enemy and they are ours... -- http://mail.python.org/mailman/listinfo/python-list