On Friday, January 17, 2014 9:56:28 PM UTC+5:30, Pete Forman wrote: > Rustom Mody writes:
> > On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote: > >> On 2014-01-17 11:14, Chris Angelico wrote: > >> > UTF-8 specifies the byte order > >> > as part of the protocol, so you don't need to mark it. > >> You don't need to mark it when writing, but some idiots use it > >> anyway. If you're sniffing a file for purposes of reading, you need > >> to look for it and remove it from the actual data that gets returned > >> from the file--otherwise, your data can see it as corruption. I end > >> up with lots of CSV files from customers who have polluted it with > >> Notepad or had Excel insert some UTF-8 BOM when exporting. This > >> means my first column-name gets the BOM prefixed onto it when the > >> file is passed to csv.DictReader, grr. > > And its part of the standard: > > Table 2.4 here > > http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf > It would have been nice if there was an eighth encoding scheme defined > there UTF-8NB which would be UTF-8 with BOM not allowed. If you or I break a standard then, well, we broke a standard. If Microsoft breaks a standard the standard is obliged to change. Or as the saying goes, everyone is equal though some are more equal. -- https://mail.python.org/mailman/listinfo/python-list