On Feb 2, 8:12 am, JKPeck <[EMAIL PROTECTED]> wrote: > On Feb 1, 1:51 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > > > They sent me the actual file, which was created on Windows, as an > > > email attachment. They had also sent the actual dataset from which > > > the XML was generated so that I could generate it myself using the > > > same version of our app as the user has. I did that but did not get > > > an exception. > > > So are you sure you open the file in binary mode on Windows? > > > Regards, > > Martin > > In the real case, the xml never goes through a file but is handed > directly to the parser. The api return a Python Unicode string > (utf-16).
A Python unicode object is *NOT* the UTF-16 that the SAX parser is expecting. It is expecting a str object which is Unicode text encoded as UTF-16. >>> unicode = u'abcde' >>> unicode_obj = u'abcde' >>> str_obj = unicode_obj.encode('UTF-16') >>> print repr(unicode_obj) u'abcde' >>> print repr(str_obj) '\xff\xfea\x00b\x00c\x00d\x00e\x00' >>> At the end of this post is code using a str object (works) then attempting to use a unicode object (reproduces your error message). > For the file the user sent, if I open it in binary mode, it > still has a BOM; otherwise the BOM is removed. But either version > works on my system. > > The basic fact, though, remains, the same code works for me with the > same input but not for two particular users (out of hundreds). If the real case doesn't involve a file, I can't imagine what you can infer from a file that isn't used [strike 1] sent to you by a user [strike 2]. Consider trapping the exception, write repr(the_xml_document_string[: 80]) to the log file and re-raise the exception. Get the user to run the app. You inspect the log file. Here's the promised code and results. C:\junk>type utf16sax.py import xml.sax, xml.sax.saxutils import cStringIO asciistr = 'qwertyuiop' xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</ data>""" unicode_doc = (xml_template % ('UTF-16', asciistr)).decode('ascii') utf16_doc = unicode_doc.encode('UTF-16') for doc in (utf16_doc, unicode_doc): print print 'doc = ', repr(doc) print f = cStringIO.StringIO() handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8') xml.sax.parseString(doc, handler) result = f.getvalue() f.close() start = result.find('<data>') + 6 end = result.find('</data>') mydata = result[start:end] print "SAX output (UTF-8): %r" % mydata C:\junk>utf16sax.py doc = '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i \x00o\x00n\x0 0=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n \x00g\x00=\x0 0"\x00U\x00T\x00F\x00-\x001\x006\x00"\x00?\x00>\x00<\x00d\x00a\x00t \x00a\x00>\x0 0q\x00w\x00e\x00r\x00t\x00y\x00u\x00i\x00o\x00p\x00<\x00/\x00d\x00a \x00t\x00a\x0 0>\x00' SAX output (UTF-8): 'qwertyuiop' doc = u'<?xml version="1.0" encoding="UTF-16"?><data>qwertyuiop</ data>' Traceback (most recent call last): File "C:\junk\utf16sax.py", line 13, in <module> xml.sax.parseString(doc, handler) File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString parser.parse(inpsrc) File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse self.feed(buffer) File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed self._err_handler.fatalError(exc) File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding specified in XML declaration is incorrect I guess what is happening is that the unicode is coerced to str using the default encoding (ascii) then it looks at the result, parses out the "UTF-16", attempts to decode it using utf-16, fails, complains. HTH, John -- http://mail.python.org/mailman/listinfo/python-list