Richard Lewis wrote: > OK, I'm still not getting this unicode business.
obviously. > <document> > <a>aàáâã</a> > <e>eèéêë</e> > <i>iìíîï</i> > <o>oòóôõ</o> > <u>oùúûü</u> > </document> > > (If testing, make sure you save this as utf-8 encoded.) why? that XML snippet doesn't include any UTF-8-encoded characters. ::: > file = codecs.open(sys.argv[1], "r", "utf-8") > document = parse(file) > file.close() why do you insist on decoding the stream you pass to the XML parser, when you've already been told that you shouldn't do that? change this to: document = parse(sys.argv[1]) > print document.toxml(encoding="utf-8") this converts the document to UTF-8, and prints it to stdout. if you get gibberish, your stdout wants some other encoding. if you get "capital- A-with-tilde" gibberish, your stdout expects ISO-8859-1. try changing this to: print document.toxml(encoding=sys.stdout.encoding) > out_str = unicode2charrefs(document.toxml(encoding="utf-8")) this converts the document to UTF-8, and then translates the *encoded* data to character references as if the document had been encoded as ISO- 8859-1. this makes no sense at all, and results in an XML document full of "capital-A-with-tilde" gibberish. > i.e., does anyone else get two byte sequences beginning with > capital-A-with-tilde instead of the expected characters? since you've requested UTF-8 output, "capital A with tilde" is the expected result if you're directing output to an ISO-8859-1 stream. > the output file is still wrong. well, you're messing it up all by yourself. getting rid of all the codecs and unicode2charrefs nonsense will fix this: document = parse(sys.argv[1]) # parser decodes ... manipulate document ... file = open(..., "w") file.write(document.toxml(encoding="utf-8")) # writer encodes file.close() </F> -- http://mail.python.org/mailman/listinfo/python-list