Re: Guessing the encoding from a BOM

2014-01-18 Thread Chris Angelico
On Sat, Jan 18, 2014 at 8:41 PM, Gregory Ewing wrote: > Chris Angelico wrote: >> >> On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence >> wrote: >> >> Every time I see it I picture Inspector >>> >>> Clouseau, "A BOM!!!" :) >> >> >> Special delivery, a berm! Were you expecting one? > > > A berm? Is th

Re: Guessing the encoding from a BOM

2014-01-18 Thread Gregory Ewing
Chris Angelico wrote: On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence wrote: Every time I see it I picture Inspector Clouseau, "A BOM!!!" :) Special delivery, a berm! Were you expecting one? A berm? Is that anything like a shrubbery? -- Greg -- https://mail.python.org/mailman/listinfo/pyth

Re: Guessing the encoding from a BOM

2014-01-17 Thread Rotwang
On 17/01/2014 18:43, Tim Chase wrote: On 2014-01-17 09:10, Mark Lawrence wrote: Slight aside, any chance of changing the subject of this thread, or even ending the thread completely? Why? Every time I see it I picture Inspector Clouseau, "A BOM!!!" :) In discussions regarding BOMs, I regular

Re: Guessing the encoding from a BOM

2014-01-17 Thread Tim Chase
On 2014-01-17 09:10, Mark Lawrence wrote: > Slight aside, any chance of changing the subject of this thread, or > even ending the thread completely? Why? Every time I see it I > picture Inspector Clouseau, "A BOM!!!" :) In discussions regarding BOMs, I regularly get the "All your base" meme from

Re: Guessing the encoding from a BOM

2014-01-17 Thread Ethan Furman
On 01/17/2014 08:46 AM, Pete Forman wrote: Chris Angelico writes: On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence wrote: Slight aside, any chance of changing the subject of this thread, or even ending the thread completely? Why? Every time I see it I picture Inspector Clouseau, "A BOM!!!" :)

Re: Guessing the encoding from a BOM

2014-01-17 Thread Chris Angelico
On Sat, Jan 18, 2014 at 3:30 AM, Rustom Mody wrote: > If you or I break a standard then, well, we broke a standard. > If Microsoft breaks a standard the standard is obliged to change. > > Or as the saying goes, everyone is equal though some are more equal. https://en.wikipedia.org/wiki/800_pound_

Re: Guessing the encoding from a BOM

2014-01-17 Thread Pete Forman
Chris Angelico writes: > On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence > wrote: >> Slight aside, any chance of changing the subject of this thread, or even >> ending the thread completely? Why? Every time I see it I picture Inspector >> Clouseau, "A BOM!!!" :) > > Special delivery, a berm! We

Re: Guessing the encoding from a BOM

2014-01-17 Thread Chris Angelico
On Sat, Jan 18, 2014 at 3:26 AM, Pete Forman wrote: > It would have been nice if there was an eighth encoding scheme defined > there UTF-8NB which would be UTF-8 with BOM not allowed. Or call that one UTF-8, and the one with the marker can be UTF-8-MS-NOTEPAD. ChrisA -- https://mail.python.org/

Re: Guessing the encoding from a BOM

2014-01-17 Thread Rustom Mody
On Friday, January 17, 2014 9:56:28 PM UTC+5:30, Pete Forman wrote: > Rustom Mody writes: > > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: > >> On 2014-01-17 11:14, Chris Angelico wrote: > >> > UTF-8 specifies the byte order > >> > as part of the protocol, so you don't need t

Re: Guessing the encoding from a BOM

2014-01-17 Thread Pete Forman
Rustom Mody writes: > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: >> On 2014-01-17 11:14, Chris Angelico wrote: >> > UTF-8 specifies the byte order >> > as part of the protocol, so you don't need to mark it. > >> You don't need to mark it when writing, but some idiots use it

Re: Guessing the encoding from a BOM

2014-01-17 Thread Chris Angelico
On Fri, Jan 17, 2014 at 8:47 PM, Mark Lawrence wrote: > On 17/01/2014 09:43, Chris Angelico wrote: >> >> On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence >> wrote: >>> >>> Slight aside, any chance of changing the subject of this thread, or even >>> ending the thread completely? Why? Every time I

Re: Guessing the encoding from a BOM

2014-01-17 Thread Mark Lawrence
On 17/01/2014 09:43, Chris Angelico wrote: On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence wrote: Slight aside, any chance of changing the subject of this thread, or even ending the thread completely? Why? Every time I see it I picture Inspector Clouseau, "A BOM!!!" :) Special delivery, a be

Re: Guessing the encoding from a BOM

2014-01-17 Thread Chris Angelico
On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence wrote: > Slight aside, any chance of changing the subject of this thread, or even > ending the thread completely? Why? Every time I see it I picture Inspector > Clouseau, "A BOM!!!" :) Special delivery, a berm! Were you expecting one? ChrisA -- h

Re: Guessing the encoding from a BOM

2014-01-17 Thread Mark Lawrence
On 17/01/2014 01:40, Tim Chase wrote: On 2014-01-17 11:14, Chris Angelico wrote: UTF-8 specifies the byte order as part of the protocol, so you don't need to mark it. You don't need to mark it when writing, but some idiots use it anyway. If you're sniffing a file for purposes of reading, you

Re: Guessing the encoding from a BOM

2014-01-16 Thread Rustom Mody
On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: > On 2014-01-17 11:14, Chris Angelico wrote: > > UTF-8 specifies the byte order > > as part of the protocol, so you don't need to mark it. > You don't need to mark it when writing, but some idiots use it > anyway. If you're sniffin

Re: Guessing the encoding from a BOM

2014-01-16 Thread Tim Chase
On 2014-01-17 11:14, Chris Angelico wrote: > UTF-8 specifies the byte order > as part of the protocol, so you don't need to mark it. You don't need to mark it when writing, but some idiots use it anyway. If you're sniffing a file for purposes of reading, you need to look for it and remove it from

Re: Guessing the encoding from a BOM

2014-01-16 Thread Steven D'Aprano
On Thu, 16 Jan 2014 11:37:29 -0800, Albert-Jan Roskam wrote: > On Thu, 1/16/14, Chris > Angelico wrote: > > Subject: Re: Guessing the encoding from a BOM To: > Cc: "python-list@python.org" Date: Thursday, > January 16,

Re: Guessing the encoding from a BOM

2014-01-16 Thread Chris Angelico
On Fri, Jan 17, 2014 at 6:37 AM, Albert-Jan Roskam wrote: > Can you elaborate on that? Unless your utf-8 files will only contain ascii > characters I do not understand why you would not want a bom utf-8. It's completely unnecessary, and could cause problems (the BOM is actually whitespace, albei

Re: Guessing the encoding from a BOM

2014-01-16 Thread Albert-Jan Roskam
On Thu, 1/16/14, Chris Angelico wrote: Subject: Re: Guessing the encoding from a BOM To: Cc: "python-list@python.org" Date: Thursday, January 16, 2014, 7:06 PM On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist wrote: > 201

Re: Guessing the encoding from a BOM

2014-01-16 Thread Tim Chase
On 2014-01-17 05:06, Chris Angelico wrote: > > You might want to add the utf8 bom too: '\xEF\xBB\xBF'. > > I'd actually rather not. It would tempt people to pollute UTF-8 > files with a BOM, which is not necessary unless you are MS Notepad. If the intent is to just sniff and parse the file acco

Re: Guessing the encoding from a BOM

2014-01-16 Thread Chris Angelico
On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist wrote: > 2014/1/16 Steven D'Aprano : >> def guess_encoding_from_bom(filename, default): >> with open(filename, 'rb') as f: >> sig = f.read(4) >> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): >> return 'utf_16' >> elif si

Re: Guessing the encoding from a BOM

2014-01-16 Thread Björn Lindqvist
2014/1/16 Steven D'Aprano : > def guess_encoding_from_bom(filename, default): > with open(filename, 'rb') as f: > sig = f.read(4) > if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): > return 'utf_16' > elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): >

Re: Guessing the encoding from a BOM

2014-01-15 Thread Ethan Furman
On 01/15/2014 10:55 PM, Steven D'Aprano wrote: On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote: +1. I'd like a custom exception class, sub-classed from ValueError. Why ValueError? It's not really a "invalid value" error, it's more "my heuristic isn't good enough" failure. (Maybe the file

Re: Guessing the encoding from a BOM

2014-01-15 Thread Steven D'Aprano
On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote: > Steven D'Aprano writes: > >> enc = guess_encoding_from_bom("filename") if enc == something: >> # Can't guess, fall back on an alternative strategy ... >> else: >> f = open("filename", encoding=enc) >> >> >> If I forget to check th

Re: Guessing the encoding from a BOM

2014-01-15 Thread Steven D'Aprano
On Thu, 16 Jan 2014 16:01:56 +1100, Chris Angelico wrote: > On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano > wrote: >> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): >> return 'utf_16' >> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): >> return 'utf_32'

Re: Guessing the encoding from a BOM

2014-01-15 Thread Ethan Furman
On 01/15/2014 07:47 PM, Ben Finney wrote: Steven D'Aprano writes: (4) Don't return anything, but raise an exception. (But which exception?) +1. I'd like a custom exception class, sub-classed from ValueError. +1 -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-lis

Re: Guessing the encoding from a BOM

2014-01-15 Thread Chris Angelico
On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano wrote: > if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')): > return 'utf_16' > elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): > return 'utf_32' I'd swap the order of these two checks. If the file starts FF FE

Re: Guessing the encoding from a BOM

2014-01-15 Thread Ben Finney
Steven D'Aprano writes: > enc = guess_encoding_from_bom("filename") > if enc == something: > # Can't guess, fall back on an alternative strategy > ... > else: > f = open("filename", encoding=enc) > > > If I forget to check the returned result, I should get an explicit > failure as

Guessing the encoding from a BOM

2014-01-15 Thread Steven D'Aprano
I have a function which guesses the likely encoding used by text files by reading the BOM (byte order mark) at the beginning of the file. A simplified version: def guess_encoding_from_bom(filename, default): with open(filename, 'rb') as f: sig = f.read(4) if sig.startswith((b'\x