Daniel Haviv created ARROW-5993: ----------------------------------- Summary: Reading a dicitionary column results in a disproportinate memory usage Key: ARROW-5993 URL: https://issues.apache.org/jira/browse/ARROW-5993 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Reporter: Daniel Haviv
I'm using pyarrow to read a 40MB parquet file. When reading all of the columns besides the "body" columns, the process peaks at 170MB. Reading only the "body" column results in over 6GB of memory used. I made the file publicly accessible: s3://dhavivresearch/pyarrow/demofile.parquet -- This message was sent by Atlassian JIRA (v7.6.14#76016)