Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-10 Thread Chris Angelico
On Tue, Feb 10, 2015 at 5:52 AM, Skip Montanaro wrote: > > This snapshot was taken against a running LibreOffice instance here at work > (on Linux). It would appear the fancy schmancy apostrophe was hosed up before > the data ever got to me. Had a guy here with Windows pop up the original file

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-10 Thread Skip Montanaro
On Mon, Feb 9, 2015 at 11:54 AM, Matthew Ruffalo wrote: > I think it's most likely that the encoding issues happened in the export > from XLSX to CSV (unless the data is malformed in the original XLSX > file, of course). Aha! Lookee here... (my apologies to all you HTML mail haters - sometimes it

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Skip Montanaro
On Mon, Feb 9, 2015 at 2:38 PM, Skip Montanaro wrote: > On Mon, Feb 9, 2015 at 2:05 PM, Zachary Ware > wrote: > > If all else fails, you can try ftfy to fix things: > > http://ftfy.readthedocs.org/en/latest/ > > Thanks for the pointer. I would prefer to not hand-mangle this stuff > in case I get

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Zachary Ware
On Mon, Feb 9, 2015 at 11:32 AM, Skip Montanaro wrote: > LibreOffice spit out a CSV file > (with those three odd bytes). My script sucked in the CSV file and > inserted data into my SQLite db. If all else fails, you can try ftfy to fix things: http://ftfy.readthedocs.org/en/latest/ >>> import

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread mm0fmf
On 09/02/2015 03:44, Skip Montanaro wrote: I am trying to process a CSV file using Python 3.5 (CPython tip as of a week or so ago). According to chardet[1], the file is encoded as utf-8: >>> s = open("data/meets-usms.csv", "rb").read() >>> len(s) 562272 >>> import chardet >>> chardet.detect(

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Matthew Ruffalo
On 02/09/2015 12:30 PM, Skip Montanaro wrote: > Thanks, Chris. Are you telling me I should have defined the input file > encoding for my CSV file as CP-1252, or that something got hosed on > the export from XLSX to CSV? Or something else? > > Skip Hi Skip- I think it's most likely that the encodi

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Chris Angelico
On Tue, Feb 10, 2015 at 4:32 AM, Skip Montanaro wrote: > On Sun, Feb 8, 2015 at 10:51 PM, Steven D'Aprano > wrote: >> The second question is, are you >> using Windows? > > No, I'm on a Mac (as, I think I indicated in my original note). All > transformations occurred on a Mac. LibreOffice spit out

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Chris Angelico
On Tue, Feb 10, 2015 at 4:30 AM, Skip Montanaro wrote: > On Sun, Feb 8, 2015 at 9:58 PM, Chris Angelico wrote: >> Those three characters are the CP-1252 decode of the bytes for U+2019 >> in UTF-8 (E2 80 99). Not sure if that helps any, but given that it was >> an XLSX file, Windows codepages are

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Skip Montanaro
On Sun, Feb 8, 2015 at 10:51 PM, Steven D'Aprano wrote: > The second question is, are you > using Windows? No, I'm on a Mac (as, I think I indicated in my original note). All transformations occurred on a Mac. LibreOffice spit out a CSV file (with those three odd bytes). My script sucked in the C

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-09 Thread Skip Montanaro
On Sun, Feb 8, 2015 at 9:58 PM, Chris Angelico wrote: > Those three characters are the CP-1252 decode of the bytes for U+2019 > in UTF-8 (E2 80 99). Not sure if that helps any, but given that it was > an XLSX file, Windows codepages are reasonably likely to show up. Thanks, Chris. Are you telling

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-08 Thread Steven D'Aprano
Skip Montanaro wrote: > sqlite> select meetname from swimmeet where meetname like > '%Barracuda%Patrick%'; > Anderson Barracudas St. Patrick's Day Swim Meet > Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet > Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet > Anderson

Re: Python 3.x stuffing utf-8 into SQLite db

2015-02-08 Thread Chris Angelico
On Mon, Feb 9, 2015 at 2:44 PM, Skip Montanaro wrote: > Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet Those three characters are the CP-1252 decode of the bytes for U+2019 in UTF-8 (E2 80 99). Not sure if that helps any, but given that it was an XLSX file, Windows codepages are

Python 3.x stuffing utf-8 into SQLite db

2015-02-08 Thread Skip Montanaro
I am trying to process a CSV file using Python 3.5 (CPython tip as of a week or so ago). According to chardet[1], the file is encoded as utf-8: >>> s = open("data/meets-usms.csv", "rb").read() >>> len(s) 562272 >>> import chardet >>> chardet.detect(s) {'encoding': 'utf-8', 'confidence': 0.99} so