Re: Mysterious xml.sax Encoding Exception

Stefan Behnel Sat, 02 Feb 2008 08:22:44 -0800

Peck, Jon schrieb:
> Yes, the characters were from the 0-127 ascii block but encoded as utf-16, so 
> there is a null byte with each nonzero character.  I.e., 
> \x00?\x00x\x00m\x00l\x00
> 
> Here is something weird I found while experimenting with ElementTree with 
> this same XML string.
> 
> Consider the same XML as a Python Unicode string, so it is actually encoded 
> as utf-16 and as a string containing utf-16 bytes.  That is
> u'<?xml version="1.0" encoding="UTF-16" st' ...
> or
> '\xff\xfe<\x00?\x00x\x00m\x00l\x00 
> \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00'...
> 
> So if these are x and y
> y = x.encode("utf-16")
> 
> The actual bytes would be the same, I think, although y is type str and x is 
> type unicode.


No. The internal representation of unicode characters is platform dependent,
and is either 2 or 4 bytes per character. If you want UTF-16, use ".encode()".


> xml.sax.parseString documentation says
> 
> parses from a buffer string received as a parameter, 
> 
> so one might imagine that either x or y would be acceptable, and the bytes 
> would be interpreted according to the encoding declaration in the byte stream.
> 
> And, in fact, both do work with xml.sax.parseString (at least for me).  With 
> etree.parse(StringIO.StringIO...) though, only the str form works.

Don't try. Serialised XML is bytes, not characters.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mysterious xml.sax Encoding Exception

Reply via email to