[issue28642] csv reader loosing rows with big files and tab delimiter
New submission from Marc Garcia: I'm using the csv module from Python standard library, to read a 1.4Gb file with 11,157,064 of rows. The file is the Geonames dataset for all countries, which can be freely downloaded [1]. I'm using this code to read it: import csv with open('allCountries.txt', 'r') as fd: reader = csv.reader(fd, delimiter='\t') for i, row in enumerate(reader): pass print(i + 1) # prints 10381963 print(reader.line_num) # prints 11157064 For some reason, there are around 7% of the rows in the files, that are skipped. The rows doesn't have anything special (most of them are all ascii characters, even if the file is in utf-8). If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped. So many of them weren't returned by the iterator when being a part of a bigger file, but now they are. Note that the attribute line_num has the right number. Also note that if I remove the delimiter parameter (tab) from the reader, and it uses the default comma, the iteration on the reader doesn't skip any row. I checked what I think it's the relevant part of the code [2], but I couldn't see anything that could cause this bug. 1. http://download.geonames.org/export/dump/allCountries.zip 2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787 -- components: Library (Lib) messages: 280323 nosy: datapythonista priority: normal severity: normal status: open title: csv reader loosing rows with big files and tab delimiter versions: Python 3.5 ___ Python tracker <http://bugs.python.org/issue28642> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28642] csv reader losing rows with big files and tab delimiter
Marc Garcia added the comment: Sorry, my fault. It looks like having quotes in the file was the problem. As mentioned, adding the quoting parameter fixes the problem. I'd assume that if quotes are not paired, csv should raise an exception. And I don't think that all the different chunks of the file I tested, had always an even number of quotes. Also, I don't understand why using the default delimiter worked well, and with tab delimiter the problem happened. I'll take a look in more detail, but I'm closing this issue. Thank you all a lot for the help! -- resolution: -> not a bug status: open -> closed title: csv reader loosing rows with big files and tab delimiter -> csv reader losing rows with big files and tab delimiter ___ Python tracker <http://bugs.python.org/issue28642> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28642] csv reader losing rows with big files and tab delimiter
Marc Garcia added the comment: I could research a bit more on the problem. This is a minimal code that reproduces what happened: from io import StringIO import csv csv_file = StringIO('''1\t"A 2\tB''') reader = csv.reader(csv_file, delimiter='\t') for i, row in enumerate(reader): pass print(reader.line_num) # 2 print(i + 1)# 1 The reason to return the right number of rows with the default delimiter, is because the quote needs to be immediately after the delimiter to be considered the opening of a quoted text. If the file contains an opening quote, and the EOF is reached without its closing quote, the reader considers all the text until EOF to be that field. This would work as expected in a line like: 1,"well quoted text","this one has a missing quote But it'd fail silently with unexpected results in all other cases. I'd expect csv to raise an exception, more than the current behavior. Do you agree? Should I create another issue to address this? -- ___ Python tracker <http://bugs.python.org/issue28642> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28642] csv reader losing rows with big files and tab delimiter
Marc Garcia added the comment: I agree that for my case, I was using the wrong quoting parameter, and if I specify that my file has no quotes, it works as expected. But I still think that in a different case, when a file do have quotes, but they are not paired, it'd be better to raise an exception, than to ignore the error and assume there is just a missing quote at the end. >From the Zen of Python: "Errors should never pass silently", and I think it's >clear that there is an error in the file. -- ___ Python tracker <http://bugs.python.org/issue28642> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com