OK, I'm still not getting this unicode business. Given this document: ========================== <?xml version="1.0" encoding="utf-8" ?>
<document> <a>aàáâã</a> <e>eèéêë</e> <i>iìíîï</i> <o>oòóôõ</o> <u>oùúûü</u> </document> ========================== (If testing, make sure you save this as utf-8 encoded.) and this Python script: ========================== import sys from xml.dom.minidom import * from xml.dom import * import codecs import string CHARACTERS = range(128,255) def unicode2charrefs(s): "Returns a unicode string with all the non-ascii characters from the given unicode string converted to character references." result = u"" for c in s: code = ord(c) if code in CHARACTERS: result += u"&#" + string.zfill(str(code), 3).decode('utf-8') + u";" else: result += c.encode('utf-8') return result def main(): print "Parsing file..." file = codecs.open(sys.argv[1], "r", "utf-8") document = parse(file) file.close() print "done." print document.toxml(encoding="utf-8") out_str = unicode2charrefs(document.toxml(encoding="utf-8")) print "Writing to '" + sys.argv[1] + "~' ..." file = codecs.open(sys.argv[1] + "~", "w", "utf-8") file.write(out_str) file.close() print "done." if __name__ == "__main__": main() ========================== Does anyone else get this output from the "print document.toxml(encoding="utf-8")" line: <document> <a>aà áâã</a> <e>eèéêë</e> <i>iìÃîï</i> <o>oòóôõ</o> <u>oùúûü</u> </document> and, similarly, this output document: ========================== <?xml version="1.0" encoding="utf-8"?> <document> <a>aàáâã</a> <e>eèéêë</e> <i>iìíîï</i> <o>oòóôõ</o> <u>oùúûü</u> </document> ========================== i.e., does anyone else get two byte sequences beginning with capital-A-with-tilde instead of the expected characters? I'm using the Kate editor from KDE and Konsole (using bash) shell on Linux (2.6 kernel). Does that make any difference? I've just tried it on the unicode-aware xterm and the "print document.toxml(encoding="utf-8")" line produces the expected output but the output file is still wrong. Any ideas whats wrong? Cheers, Richard -- http://mail.python.org/mailman/listinfo/python-list