Re: Guessing the encoding from a BOM

Pete Forman Fri, 17 Jan 2014 08:32:15 -0800

Rustom Mody <[email protected]> writes:

> On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
>> On 2014-01-17 11:14, Chris Angelico wrote:
>> > UTF-8 specifies the byte order
>> > as part of the protocol, so you don't need to mark it.
>
>> You don't need to mark it when writing, but some idiots use it
>> anyway.  If you're sniffing a file for purposes of reading, you need
>> to look for it and remove it from the actual data that gets returned
>> from the file--otherwise, your data can see it as corruption.  I end
>> up with lots of CSV files from customers who have polluted it with
>> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
>> means my first column-name gets the BOM prefixed onto it when the
>> file is passed to csv.DictReader, grr.
>
> And its part of the standard:
> Table 2.4 here
> http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf


It would have been nice if there was an eighth encoding scheme defined
there UTF-8NB which would be UTF-8 with BOM not allowed.
-- 
Pete Forman
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Guessing the encoding from a BOM

Reply via email to