Following up on what I have found with Uwe's advice and poking around the
code base.

* `columns=` helped but it was because forced me to realize I did not need
all of the columns at once every time. No particular column was
significantly worse in memory usage.
* There seems to be some interaction between
`parquet::internal::RecordReader` and `arrow::PoolBuffer` or
`arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold
the entire column in memory without compression/encoding even though Arrow
supports dictionary encoding (and the column is dictionary encoded).

I imagine `RecordReader` requests enough memory to hold the data without
encoding/compression for good reason (perhaps more robust assumptions about
the underlying memory pool?), but is there a way to request only the memory
require for dictionary encoding when it is an option?

My (incomplete) understanding comes from the surrounding lines here
https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/record_reader.cc#L232
.

On Wed, Apr 25, 2018 at 2:23 PM Bryant Menn <bryant.m...@gmail.com> wrote:

> Uwe,
>
> I'll try pinpointing things further with `columns=` and try to reproduce
> what I find with data I can share.
>
> Thanks for the pointer.
>
> -Bryant
>
> On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>
>> No, there is no need to pass any options on reading. Sometimes they are
>> beneficial depending on what you want to achieve but defaults are ok, too.
>>
>> I'm not sure if you're able to post an example but it would be nice if
>> you could post the resulting Arrow schema from the table. It might be
>> related to a specific type. A quick way to debug this on your side would
>> also be to specify only a subset of columns to read using the `columns=`
>> attribute on read_table. Maybe you can already pinpoint the memory problems
>> to a specific column. Having these hints would it make easier for us to
>> diagnose what the underlying problem is.
>>
>> Uwe
>>
>> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
>> > Uwe,
>> >
>> > I am not. Should I be? I forgot to mention earlier that the Parquet file
>> > came from Spark/PySpark.
>> >
>> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>> >
>> > > Hello Bryant,
>> > >
>> > > are you using any options on `pyarrow.parquet.read_table` or a
>> possible
>> > > `to_pandas` afterwards?
>> > >
>> > > Uwe
>> > >
>> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
>> > > > I tried reading a Parquet file (<200MB, lots of text with snappy)
>> using
>> > > > read_table and saw the memory usage peak over 8GB before settling
>> back
>> > > down
>> > > > to ~200MB. This surprised me as I was expecting to be able to
>> handle a
>> > > > Parquet file of this size with much less RAM (doing some processing
>> with
>> > > > smaller VMs).
>> > > >
>> > > > I am not sure if this expected, but I thought I might check with
>> everyone
>> > > > here and learn something new. Poking around it seems to be related
>> with
>> > > > ParquetReader.read_all?
>> > > >
>> > > > Thanks in advance,
>> > > > Bryant
>> > >
>>
>

Reply via email to