If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?
Mostly correct. I would prefer if people referred to the thing not as "BOM" but as "UTF-8 signature", atleast in the context of UTF-8, as UTF-8 has no byte-order issues that a "byte order mark" would deal with. (it is correct to call it "BOM" in the context of UTF-16 or UTF-32).
Also, "some systems" is inadequate. It is not so much the operating system that decides to add or leave out the UTF-8 signature, but much more the application writing the file. Any high-quality tool will accept the file with or without signature, whether it is a tool on Windows or a tool on Unix.
I personally would write my applications so that they put the signature
into files that cannot be concatenated meaningfully (since the
signature simplifies encoding auto-detection) and leave out the signature from files which can be concatenated (as concatenating the
files will put the signature in the middle of a file).
I guess that this leading BOM mark are special marking bytes that can't be, in no way, decoded as valid text.
Right ?
Wrong. The BOM mark decodes as U+FEFF:
>>> codecs.BOM_UTF8.decode("utf-8") u'\ufeff'
This is what makes it a byte order mark: in UTF-16, you can tell the byte order by checking whether it is FEFF or FFFE. The character U+FFFE is an invalid character, which cannot be decoded as valid text (although the Python codec will decode it as invalid text).
I also guess that this leading BOM mark is silently ignored by any unicode aware file stream reader to which we already indicated that the file follows the UTF-8 encoding standard.
Right ?
No. It should eventually be ignored by the application, but whether the stream reader special-cases it or not is depends on application needs.
If so, is it the case with the python codecs decoder ?
No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports it to the application when it finds it, and it will never generate the signature on its own. So processing the UTF-8 signature is left to the application in Python.
In python documentation, I see theseconstants. The documentation is not clear to which encoding these constants apply. Here's my understanding :
BOM : UTF-8 only or UTF-8 and UTF-32 ?
UTF-16.
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ? BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
UTF-16
Why should I need these constants if codecs decoder can handle them without my help, only specifying the encoding ?
Well, because the codecs don't. It might be useful to add a "utf-8-signature" codec some day, which generates the signature on encoding, and removes it on decoding.
Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list