Lawrence D'Oliveiro wrote:
In message <87hbgyosdc....@web.de>, Diez B. Roggisch wrote:
Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> writes:
In message <87d3rorf2f....@web.de>, Diez B. Roggisch wrote:
Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> writes:
What exactly is the point of a BOM in a UTF-8-encoded file?
It's a marker like the "coding: utf-8" in python-files. It tells the
software aware of it that the content is UTF-8.
But if the software is aware of it, then why does it need to be told?
Let me rephrase: windows editors such as notepad recognize the BOM, and
then assume (hopefully rightfully so) that the rest of the file is text
in utf-8 encoding.
But they can only recognize it as a BOM if they assume UTF-8 encoding to
begin with. Otherwise it could be interpreted as some other coding.
Not so. The first three bytes are the flag. For example, in a .dbf
file, the first byte determines what type of dbf the file is: \x03 =
dBase III, \x83 = dBase III with memos, etc. More checking should
naturally be done to ensure the rest of the fields make sense for the
dbf type specified.
MS decided that if the first three bytes = \xEF \xBB \xBF then it's a
UTF-8 file, and if it is not, don't open it with an MS product.
Likewise, MS will add those bytes to any UTF-8 file it saves.
Naturally, this causes problems for non-MS usages, but anybody who's had
to work with both MS and non-MS platforms/products/methodologies knows
that MS does not play well with others.
~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list