Re: Peak memory usage for pyarrow.parquet.read_table

2018-06-10 Thread Wes McKinney
> * There seems to be some interaction between > `parquet::internal::RecordReader` and `arrow::PoolBuffer` or > `arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold the > entire column in memory without compression/encoding even though Arrow > supports dictionary encoding (a

Re: Peak memory usage for pyarrow.parquet.read_table

2018-05-29 Thread Bryant Menn
Following up on what I have found with Uwe's advice and poking around the code base. * `columns=` helped but it was because forced me to realize I did not need all of the columns at once every time. No particular column was significantly worse in memory usage. * There seems to be some interaction

Re: Peak memory usage for pyarrow.parquet.read_table

2018-04-25 Thread Bryant Menn
Uwe, I'll try pinpointing things further with `columns=` and try to reproduce what I find with data I can share. Thanks for the pointer. -Bryant On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn wrote: > No, there is no need to pass any options on reading. Sometimes they are > beneficial depending

Re: Peak memory usage for pyarrow.parquet.read_table

2018-04-25 Thread Uwe L. Korn
No, there is no need to pass any options on reading. Sometimes they are beneficial depending on what you want to achieve but defaults are ok, too. I'm not sure if you're able to post an example but it would be nice if you could post the resulting Arrow schema from the table. It might be related

Re: Peak memory usage for pyarrow.parquet.read_table

2018-04-25 Thread Bryant Menn
Uwe, I am not. Should I be? I forgot to mention earlier that the Parquet file came from Spark/PySpark. On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn wrote: > Hello Bryant, > > are you using any options on `pyarrow.parquet.read_table` or a possible > `to_pandas` afterwards? > > Uwe > > On Wed, Apr

Re: Peak memory usage for pyarrow.parquet.read_table

2018-04-25 Thread Uwe L. Korn
Hello Bryant, are you using any options on `pyarrow.parquet.read_table` or a possible `to_pandas` afterwards? Uwe On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote: > I tried reading a Parquet file (<200MB, lots of text with snappy) using > read_table and saw the memory usage peak over 8GB be

Peak memory usage for pyarrow.parquet.read_table

2018-04-25 Thread Bryant Menn
I tried reading a Parquet file (<200MB, lots of text with snappy) using read_table and saw the memory usage peak over 8GB before settling back down to ~200MB. This surprised me as I was expecting to be able to handle a Parquet file of this size with much less RAM (doing some processing with smaller