[snipped]I'm well aware of the need of a bom for fixed-size multibyte-characters like utf16.
But I don't see the need for that on an utf-8 byte sequence, and I first encountered that in MS tool output - can't remember when and what exactly that was. And I have to confess that I attributed that as a stupidity from MS. But according to the FAQ you mentioned, it is apparently legal in utf-8 too. Neverless the FAQ states:
So they admit that it makes no sense - especially as decoding a utf-8 string given any 8-bit encoding like latin1 will succeed.
They say that it makes no sense as an byte-order indicator but they indicate that it can be used as a file signature.
And I'm not sure what you mean about decoding a UTF-8 string given any 8-bit encoding. Of course the encoder must be know:
>>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r' ... .encode('utf-8').decode('latin1').encode('latin1') 'T\xc3\xbcr'
I can assume you that most Germans can differentiate between "Tür" and "Tã¼r".
Using a BOM with UTF-8 makes it easy to indentify it as such AND it shouldn't break any probably written Unicode-aware tools.
Cheers, Brian -- http://mail.python.org/mailman/listinfo/python-list