[issue18219] csv.DictWriter is slow when writing files with large number of columns
New submission from Mikhail Traskin: _dict_to_list method of the csv.DictWriter objects created with extrasaction="raise" uses look-up in the list of field names to check if current row has any unknown fields. This results in O(n^2) execution time and is very slow if there are a lot of columns in a CSV file (in hundreds or thousands). Replacing look-up in a list with a look-up in a set solves the issue (see the attached patch). -- components: Library (Lib) files: csvdictwriter.patch keywords: patch messages: 191197 nosy: mtraskin priority: normal severity: normal status: open title: csv.DictWriter is slow when writing files with large number of columns type: performance Added file: http://bugs.python.org/file30598/csvdictwriter.patch ___ Python tracker <http://bugs.python.org/issue18219> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18219] csv.DictWriter is slow when writing files with large number of columns
Mikhail Traskin added the comment: Any way is fine with me. If you prefer to avoid having public filedset property, please use the attached patch. -- Added file: http://bugs.python.org/file30605/csvdictwriter.v2.patch ___ Python tracker <http://bugs.python.org/issue18219> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18219] csv.DictWriter is slow when writing files with large number of columns
Mikhail Traskin added the comment: > What is the purpose in touching fieldnames [...] Wrapping the fieldnames property and tupleizing it guarantees that fieldnames and _fieldset fields are consistent. Otherwise, having a separate _fieldset field means that someone who is modifying the fieldnames field will not modify the _fieldset. This will result in inconsistent DictWriter behavior. Normal DictWriter users (ones that do not modify fieldnames after DictWriter was created) will not notice this wrapper. "Non-normal" DictWriter will have their code broken, but it is better than having inconsistent internal data structures since these errors are very hard to detect. If you insist on keeping the interface intact, then use the attached v3 of the patch: it creates a fieldset object every time the _dict_to_list method is executed. This does slow execution down, but performance is acceptable, just about 1.5 time slower than version with _fieldset field. > wrong_fields could be calculated with [...] I believe it is better to report all wrong fields at ones. In addition this optimization is meaningless, since usually, unless something is wrong, the field check will require full scan of the rowdict. > That said, in 3.x, replacing [...] In 2.x the list comprehension version is faster than the set difference version. In 3.x the set difference is slightly faster (maybe 10% faster). However, list comprehension works both in 2.x and 3.x, while set difference requires different code for them. Hence I prefer sticking with list comprehension. > Does test/text_cvs have tests [...] No there are no tests for wrong fields. Correct fields are already checked with standard writing tests. I do not know how you write tests for exception handling. If you provide a link with instructions, I can write the missing test part. > Also, if you have not done so yet, please go to [...] I have already done this. -- Added file: http://bugs.python.org/file31297/csvdictwriter.v3.patch ___ Python tracker <http://bugs.python.org/issue18219> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18219] csv.DictWriter is slow when writing files with large number of columns
Mikhail Traskin added the comment: Peter, thank you for letting me know that views work with list, I was not aware of this. This is indeed the best solution and it also keeps the DictWriter interface unchanged. Terry, attached patch contains the DictWriter change and a test case in test_csv.py. -- Added file: http://bugs.python.org/file31570/csvdictwriter.v4.patch ___ Python tracker <http://bugs.python.org/issue18219> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com