On Tue, Feb 10, 2015 at 4:30 AM, Skip Montanaro <skip.montan...@gmail.com> wrote: > On Sun, Feb 8, 2015 at 9:58 PM, Chris Angelico <ros...@gmail.com> wrote: >> Those three characters are the CP-1252 decode of the bytes for U+2019 >> in UTF-8 (E2 80 99). Not sure if that helps any, but given that it was >> an XLSX file, Windows codepages are reasonably likely to show up. > > Thanks, Chris. Are you telling me I should have defined the input file > encoding for my CSV file as CP-1252, or that something got hosed on > the export from XLSX to CSV? Or something else? > > Skip
Well, I'm not entirely sure. If your input file is actually CP-1252 and you try to decode it as UTF-8, you'll almost certainly get an error (unless of course it's all ASCII, but you know it isn't in this case). Also, I'd say chardet will be correct. But it might be worth locating one of those apostrophes in the file and looking at the actual bytes representing it... because what you may have is a crazy double-encoded system. If you take a document with U+2019 in it and encode it UTF-8, then decode it as CP-1252, then re-encode as UTF-8, you could get that. (I think. Haven't actually checked.) If someone gave UTF-8 bytes to a program that doesn't know the difference between bytes and characters, and assumes CP-1252, then you might well get something like this. Hence, having a look at the exact bytes in the .CSV file may help. Easiest might be to pull it up in a hex viewer (I use 'hd' on my Debian systems), and grep for the critical line. Otherwise, use Python and try to pull out a line from the byte stream. Good luck. You may need it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list