On 2021-02-09, Skip Montanaro <skip.montan...@gmail.com> wrote: > I downloaded US hospital ICU capacity data this morning from this page: > > https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility > > (The download link is about halfway down the page.) > > Trying to read it using my personal CSV tools without specifying an > encoding, it failed to understand the first column, hospital_pk. That is > apparently because the file isn't simply ASCII or UTF-8. There are a few > bytes ahead of the "h". However, if I open the file using "utf-16" as the > encoding, Python complains there is no BOM. od(1) suggests there is > *something* ahead of the first column name, but it's three bytes, not two: > > % od -A x -t x1z -v < > reported_hospital_capacity_admissions_facility_level_weekly_average_timeseries_20210207.csv >| head > 000000 *ef bb bf* 68 6f 73 70 69 74 61 6c 5f 70 6b 2c 63 >...hospital_pk,c< > 000010 6f 6c 6c 65 63 74 69 6f 6e 5f 77 65 65 6b 2c 73 >ollection_week,s< > 000020 74 61 74 65 2c 63 63 6e 2c 68 6f 73 70 69 74 61 >tate,ccn,hospita< > ...
It's UTF-8 with a UTF-16 BOM prepended, which is not uncommon when you have a file that's been converted to UTF-8 from UTF-16 or has been produced by shitty Microsoft software. You can tell instantly at a glance that it's not UTF-16 because the ascii dump would l.o.o.k. .l.i.k.e. .t.h.i.s. You can decode it as utf-8 and ignore the BOM character, or as someone else has rightly said, Python can decode it as utf-8-sig, which does that automatically for you. -- https://mail.python.org/mailman/listinfo/python-list