[issue28642] csv reader loosing rows with big files and tab delimiter

2016-11-08 Thread Marc Garcia

New submission from Marc Garcia:

I'm using the csv module from Python standard library, to read a 1.4Gb file 
with 11,157,064 of rows. The file is the Geonames dataset for all countries, 
which can be freely downloaded [1].

I'm using this code to read it:

import csv

with open('allCountries.txt', 'r') as fd:
reader = csv.reader(fd, delimiter='\t')
for i, row in enumerate(reader):
pass

print(i + 1)  # prints 10381963
print(reader.line_num)  # prints 11157064

For some reason, there are around 7% of the rows in the files, that are 
skipped. The rows doesn't have anything special (most of them are all ascii 
characters, even if the file is in utf-8).

If I create a new file with all the skipped files, and I read it again in the 
same way, around 30% of the rows are skipped. So many of them weren't returned 
by the iterator when being a part of a bigger file, but now they are.

Note that the attribute line_num has the right number. Also note that if I 
remove the delimiter parameter (tab) from the reader, and it uses the default 
comma, the iteration on the reader doesn't skip any row.

I checked what I think it's the relevant part of the code [2], but I couldn't 
see anything that could cause this bug.


1. http://download.geonames.org/export/dump/allCountries.zip
2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787

--
components: Library (Lib)
messages: 280323
nosy: datapythonista
priority: normal
severity: normal
status: open
title: csv reader loosing rows with big files and tab delimiter
versions: Python 3.5

___
Python tracker 
<http://bugs.python.org/issue28642>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28642] csv reader losing rows with big files and tab delimiter

2016-11-08 Thread Marc Garcia

Marc Garcia added the comment:

Sorry, my fault. It looks like having quotes in the file was the problem. As 
mentioned, adding the quoting parameter fixes the problem.

I'd assume that if quotes are not paired, csv should raise an exception. And I 
don't think that all the different chunks of the file I tested, had always an 
even number of quotes.

Also, I don't understand why using the default delimiter worked well, and with 
tab delimiter the problem happened.

I'll take a look in more detail, but I'm closing this issue.

Thank you all a lot for the help!

--
resolution:  -> not a bug
status: open -> closed
title: csv reader loosing rows with big files and tab delimiter -> csv reader 
losing rows with big files and tab delimiter

___
Python tracker 
<http://bugs.python.org/issue28642>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28642] csv reader losing rows with big files and tab delimiter

2016-11-09 Thread Marc Garcia

Marc Garcia added the comment:

I could research a bit more on the problem. This is a minimal code that 
reproduces what happened:

from io import StringIO
import csv

csv_file = StringIO('''1\t"A
2\tB''')

reader = csv.reader(csv_file, delimiter='\t')
for i, row in enumerate(reader):
pass

print(reader.line_num)  # 2
print(i + 1)# 1

The reason to return the right number of rows with the default delimiter, is 
because the quote needs to be immediately after the delimiter to be considered 
the opening of a quoted text.

If the file contains an opening quote, and the EOF is reached without its 
closing quote, the reader considers all the text until EOF to be that field.

This would work as expected in a line like:

1,"well quoted text","this one has a missing quote

But it'd fail silently with unexpected results in all other cases. I'd expect 
csv to raise an exception, more than the current behavior.

Do you agree? Should I create another issue to address this?

--

___
Python tracker 
<http://bugs.python.org/issue28642>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28642] csv reader losing rows with big files and tab delimiter

2016-11-09 Thread Marc Garcia

Marc Garcia added the comment:

I agree that for my case, I was using the wrong quoting parameter, and if I 
specify that my file has no quotes, it works as expected.

But I still think that in a different case, when a file do have quotes, but 
they are not paired, it'd be better to raise an exception, than to ignore the 
error and assume there is just a missing quote at the end.

>From the Zen of Python: "Errors should never pass silently", and I think it's 
>clear that there is an error in the file.

--

___
Python tracker 
<http://bugs.python.org/issue28642>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com