Lawrence D'Oliveiro wrote:
In message <87hbgyosdc....@web.de>, Diez B. Roggisch wrote:

Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> writes:

In message <87d3rorf2f....@web.de>, Diez B. Roggisch wrote:

Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> writes:

What exactly is the point of a BOM in a UTF-8-encoded file?
It's a marker like the "coding: utf-8" in python-files. It tells the
software aware of it that the content is UTF-8.
But if the software is aware of it, then why does it need to be told?
Let me rephrase: windows editors such as notepad recognize the BOM, and
then assume (hopefully rightfully so) that the rest of the file is text
in utf-8 encoding.

But they can only recognize it as a BOM if they assume UTF-8 encoding to begin with. Otherwise it could be interpreted as some other coding.

Not so. The first three bytes are the flag. For example, in a .dbf file, the first byte determines what type of dbf the file is: \x03 = dBase III, \x83 = dBase III with memos, etc. More checking should naturally be done to ensure the rest of the fields make sense for the dbf type specified.

MS decided that if the first three bytes = \xEF \xBB \xBF then it's a UTF-8 file, and if it is not, don't open it with an MS product. Likewise, MS will add those bytes to any UTF-8 file it saves.

Naturally, this causes problems for non-MS usages, but anybody who's had to work with both MS and non-MS platforms/products/methodologies knows that MS does not play well with others.

~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to