Try setting encoding to: "utf-8-sig". 'eb bb bf' is the byte order mark for UTF8 (most systems do not include this in UTF-8 encoded files)
Python will correctly read UTF8 BOMs if you use the 'utf-8-sig' encoding when reading files Steve On Tue, Feb 9, 2021 at 2:56 PM Skip Montanaro <skip.montan...@gmail.com> wrote: > I downloaded US hospital ICU capacity data this morning from this page: > > > https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility > > (The download link is about halfway down the page.) > > Trying to read it using my personal CSV tools without specifying an > encoding, it failed to understand the first column, hospital_pk. That is > apparently because the file isn't simply ASCII or UTF-8. There are a few > bytes ahead of the "h". However, if I open the file using "utf-16" as the > encoding, Python complains there is no BOM. od(1) suggests there is > *something* ahead of the first column name, but it's three bytes, not two: > > % od -A x -t x1z -v < > > reported_hospital_capacity_admissions_facility_level_weekly_average_timeseries_20210207.csv > | head > 000000 *ef bb bf* 68 6f 73 70 69 74 61 6c 5f 70 6b 2c 63 > >...hospital_pk,c< > 000010 6f 6c 6c 65 63 74 69 6f 6e 5f 77 65 65 6b 2c 73 >ollection_week,s< > 000020 74 61 74 65 2c 63 63 6e 2c 68 6f 73 70 69 74 61 >tate,ccn,hospita< > ... > > I'm opening the file like so: > > inf = open(args[0], "r", encoding=encoding) > > where encoding is passed on the command line. I know I can simply edit out > those bytes and probably be good-to-go, but I'd prefer not to. What should > I be passing for the encoding? > > Skip, who thought everybody had effectively settled on utf-8 at this point, > but apparently not... > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list