Re: Is this a bug? BOM decoded with UTF8

pekka niiranen wrote:

I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
    EFBBBF6161

Contents of "my.utf16" in HEX:
    FEFF6161

This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.

Correct, I used hexeditor to create those files.

Is there a trick to read UTF8 encoded file with BOM not decoded?
It's very easy: just drop the first character if it is the BOM.


I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?

The UTF-8 codec will never do this on its own.

Never? Hmm, so that is not going to change in future versions?

Regards,
Martin

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to