Russ Allbery <r...@debian.org> writes: > I don't believe it's correct to expect UTF-8 files to include this. > I've heard of BOM marks used this from the very early days of Unicode, > but so far as I understand it, the world has largely given up on this > approach and UTF-8 generators do not produce them.
I did a bit more research, and apparently this approach has become more blessed again. I'm glad I looked it up! As of Unicode 5.0, the standard explicitly recommended against doing this, but the latest version of the standard is moderately positive about it (although doesn't require it): In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>. Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. (although it does strongly discourage it if there's any other signaling method available). I'm still a bit dubious about this, since I don't believe editors and generators normally add it, but given how we generate the text versions of the documents, it's relatively easy to add a leading BOM and seems harmless. I'll take a look. -- Russ Allbery (r...@debian.org) <http://www.eyrie.org/~eagle/>