Bogdan Klichuk created ARROW-5791:
-------------------------------------

             Summary: pyarrow.csv.read_csv hangs + eats all RAM
                 Key: ARROW-5791
                 URL: https://issues.apache.org/jira/browse/ARROW-5791
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.13.0
         Environment: Ubuntu Xenial, python 2.7
            Reporter: Bogdan Klichuk
         Attachments: csvtest.py, graph.svg, sample_32768_cols.csv, 
sample_32769_cols.csv

I have quite a sparse dataset in CSV format. A wide table that has several rows 
but many (32k) columns. Total size ~540K.

When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats 
all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files 
are under attachments.

1) `sample_32769_cols.csv` is the dataset that suffers the problem.

2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in 
under 400ms on my machine. It's the same dataset without ONE last column. That 
last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution 
and hanging failure which looks like some memory leak - I don't know.

I have created flame graph for the case (1) to support this issue resolution 
(`graph.svg`).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to