> What are you talking about? The BOM and UTF-16 go hand-and-hand. > Without a Byte Order Mark, you can't unambiguosly determine whether big > or little endian UTF-16 was used. If, for example, you came across a > UTF-16 text file containing this hexidecimal data: 2200> > what would you assume? That is is quote character in little-endian > format or that it is a for-all symbol in big-endian format?
I'm well aware of the need of a bom for fixed-size multibyte-characters like utf16. But I don't see the need for that on an utf-8 byte sequence, and I first encountered that in MS tool output - can't remember when and what exactly that was. And I have to confess that I attributed that as a stupidity from MS. But according to the FAQ you mentioned, it is apparently legal in utf-8 too. Neverless the FAQ states: """ Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature ? an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD] """ So they admit that it makes no sense - especially as decoding a utf-8 string given any 8-bit encoding like latin1 will succeed. So in the end, I stand corrected. But I still think its crap - But not MS crap. :) -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list