[issue43625] CSV has_headers heuristic could be improved

Skip Montanaro Thu, 25 Mar 2021 14:04:49 -0700


Skip Montanaro <skip.montan...@gmail.com> added the comment:


I assume the OP is referring to this sort of usage:

>>> sniffer = csv.Sniffer()
>>> raw = open("mixed.csv").read()
>>> sniffer.has_header(raw)
False

*sigh*

I really wish the Sniffer class had never been added to the CSV module. I can't 
recall who wrote it (the author is long gone). Though I am responsible for the 
initial commits, it wasn't me or the main authors of csvmodule.c. As far as I 
know, it never really worked well. I can't recall ever using it.

A simpler heuristic would be if the first row contains a bunch of strings and 
the second row contains a bunch of numbers, then the file has a header. That 
assumes that CSV files consist mostly of numeric data.

Looking at has_header, I see this:

    for thisType in [int, float, complex]:

I think this particular problem would be solved if the order of those types 
were reversed. The attached diff suggests that as well. Note that the Sniffer 
class currently contains no test cases, so that the test I added failed before 
the change and passes after doesn't mean it doesn't break someone's mission 
critical Sniffer usage.

(Sorry, Raymond. My Github-foo is insufficient to allow me to fork, apply the 
diff and create a PR.)

----------
keywords: +patch
Added file: https://bugs.python.org/file49915/csv.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43625>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43625] CSV has_headers heuristic could be improved

Reply via email to