On May 25, 12:03 pm, "sim.sim" <[EMAIL PROTECTED]> wrote: > On 25 ÍÁÊ, 12:45, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > > > In <[EMAIL PROTECTED]>, sim.sim wrote: > > > Below the code that tryes to parse an well-formed xml, but it fails > > > with error message: > > > "not well-formed (invalid token): line 3, column 85" > > > How did you verified that it is well formed? `xmllint` barf on it too. > > you can try to write iMessage to file and open it using Mozilla > Firefox (web-browser) > > > > > > > > > > The "problem" within CDATA-section: it consists a part of utf-8 > > > encoded string wich was splited (widely used for memory limited > > > devices). > > > > When minidom parses the xml-string, it fails becouse it tryes to convert > > > into unicode the data within CDATA-section, insted of just to return the > > > value of the section "as is". The convertion contradicts the > > > specificationhttp://www.w3.org/TR/REC-xml/#sec-cdata-sect > > > An XML document contains unicode characters, so does the CDTATA section. > > CDATA is not meant to put arbitrary bytes into a document. It must > > contain valid characters of this > > typehttp://www.w3.org/TR/REC-xml/#NT-Char(linkedfrom the grammar of CDATA in > > your link above). > > > Ciao, > > Marc 'BlackJack' Rintsch > > my CDATA-section contains only symbols in the range specified for > Char: > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x10000-#x10FFFF] > > filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)- Hide quoted > text - > > - Show quoted text -
You need to explicitly convert the string of UTF8 encoded bytes to a Unicode string before parsing e.g. unicodestring = unicode(encodedbytes, 'utf8') Unless I messed up copying and pasting, your original string had an erroneous byte immediately before ]]>. With that corrected I was able to process the string correctly - the CDATA marked section consits entirely of spaces and Cyrillic characters. As I noted earlier you will lose \r characters as part of the basic XML processing. HTH Harvey -- http://mail.python.org/mailman/listinfo/python-list