Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks.
If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ? I guess that this leading BOM mark are special marking bytes that can't be, in no way, decoded as valid text. Right ? (I really really hope the answer is yes otherwise we're in hell when moving file from one platform to another, even with the same Unicode encoding). I also guess that this leading BOM mark is silently ignored by any unicode aware file stream reader to which we already indicated that the file follows the UTF-8 encoding standard. Right ? If so, is it the case with the python codecs decoder ? In python documentation, I see theseconstants. The documentation is not clear to which encoding these constants apply. Here's my understanding : BOM : UTF-8 only or UTF-8 and UTF-32 ? BOM_BE : UTF-8 only or UTF-8 and UTF-32 ? BOM_LE : UTF-8 only or UTF-8 and UTF-32 ? BOM_UTF8 : UTF-8 only BOM_UTF16 : UTF-16 only BOM_UTF16_BE : UTF-16 only BOM_UTF16_LE : UTF-16 only BOM_UTF32 : UTF-32 only BOM_UTF32_BE : UTF-32 only BOM_UTF32_LE : UTF-32 only Why should I need these constants if codecs decoder can handle them without my help, only specifying the encoding ? Thank you Francis Girard Python tells me to use an encoding declaration at the top of my files (the message is referring to http://www.python.org/peps/pep-0263.html). I expected to see there a list of acceptable -- http://mail.python.org/mailman/listinfo/python-list