Hi there,

I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
        EFBBBF6161

Contents of "my.utf16" in HEX:
        FEFF6161


For some reason Python2.4 decodes the BOM for UTF8 but not for UTF16. See below:

>>> fh = codecs.open("my.uft8", "rb", "utf8")
>>> fh.readlines()
[u'\ufeffaa']   # BOM is decoded, why
>>> fh.close()
>>> fh = codecs.open("my.utf16", "rb", "utf16")
>>> fh.readlines()
[u'\u6161']     # No BOM here
>>> fh.close()

Is there a trick to read UTF8 encoded file with BOM not decoded?

-pekka-
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to