Re: Is this a bug? BOM decoded with UTF8

Brian Quinlan Fri, 11 Feb 2005 07:09:38 -0800

Diez B. Roggisch wrote:

I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:

[snipped]

So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

They say that it makes no sense as an byte-order indicator but they indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any 8-bit encoding. Of course the encoder must be know:

>>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r'
...   .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'

I can assume you that most Germans can differentiate between "Tür" and "Tăźr".

Using a BOM with UTF-8 makes it easy to indentify it as such AND it shouldn't break any probably written Unicode-aware tools.

Cheers,
Brian
--
http://mail.python.org/mailman/listinfo/python-list

Re: Is this a bug? BOM decoded with UTF8

Reply via email to