On Thu, 16 Jan 2014 11:37:29 -0800, Albert-Jan Roskam wrote: > -------------------------------------------- On Thu, 1/16/14, Chris > Angelico <ros...@gmail.com> wrote: > > Subject: Re: Guessing the encoding from a BOM To: > Cc: "python-list@python.org" <python-list@python.org> Date: Thursday, > January 16, 2014, 7:06 PM > > On Fri, Jan 17, 2014 at 5:01 AM, > Björn Lindqvist <bjou...@gmail.com> > wrote: > > 2014/1/16 Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > >> def guess_encoding_from_bom(filename, default): > >> with open(filename, 'rb') > as f: > >> sig = > f.read(4) > >> if > sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): > >> return > 'utf_16' > >> elif > sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): > >> return > 'utf_32' > >> else: > >> return > default > > > > You might want to add the utf8 bom too: > '\xEF\xBB\xBF'. > > I'd actually rather not. It would tempt people to pollute UTF-8 files > with a BOM, which is not necessary unless you are MS Notepad. > > > ===> Can you elaborate on that? Unless your utf-8 files will only > contain ascii characters I do not understand why you would not want a > bom utf-8.
Because the UTF-8 signature -- it's not actually a Byte Order Mark -- is not really necessary. Unlike UTF-16 and UTF-32, there is no platform dependent ambiguity between Big Endian and Little Endian systems, so the UTF-8 stream of bytes is identical no matter what platform you are on. If the UTF-8 signature was just unnecessary, it wouldn't be too bad, but it's actually harmful. Pure-ASCII text encoded as UTF-8 is still pure ASCII, and so backwards compatible with old software that assumes ASCII. But the same pure-ASCII text encoded as UTF-8 with a signature looks like a binary file. > Btw, isn't "read_encoding_from_bom" a better function name than > "guess_encoding_from_bom"? I thought the point of BOMs was that there > would be no more need to guess? Of course it's a guess. If you see a file that starts with 0000FFFE, is that a UTF-32 text file, or a binary file that happens to start with two nulls followed by FFFE? -- Steven -- https://mail.python.org/mailman/listinfo/python-list