> They say that it makes no sense as an byte-order indicator but they > indicate that it can be used as a file signature. > > And I'm not sure what you mean about decoding a UTF-8 string given any > 8-bit encoding. Of course the encoder must be know:
That every utf-8 string can be decoded in any byte-sized encoding. Does it make sense? No. But does it fail (as decoding utf-8 frequently does)? No. So if you are in a situation where you _don't_ know the encoding, a decoding can only be based on a heuristic. And a utf-8 BOM can be part of that heuristic - but it still is only a hint. Besides that, lots of tools don't produce it. E.g. everything that produces/consumes xml doesn't need it. > >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r' > ... .encode('utf-8').decode('latin1').encode('latin1') > 'T\xc3\xbcr' If the encoder is to be known, using the BOM becomes obsolete. > I can assume you that most Germans can differentiate between "Tür" and > "Tã¼r". Oh, germans can. Computers oth can't. You could try and use common words like "für" and so on for a heuristic. But that is no guarantee. > Using a BOM with UTF-8 makes it easy to indentify it as such AND it > shouldn't break any probably written Unicode-aware tools. As the faq states, that can very well happen. -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list