Re: Peak memory usage for pyarrow.parquet.read_table

Wes McKinney Sun, 10 Jun 2018 21:11:12 -0700

> * There seems to be some interaction between 
> `parquet::internal::RecordReader` and `arrow::PoolBuffer` or 
> `arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold the 
> entire column in memory without compression/encoding even though Arrow 
> supports dictionary encoding (and the column is dictionary encoded).


This is quite tricky. The Parquet format allows for dictionary
encoding as a data compression strategy, but it's not the same thing
as Arrow's dictionary encoding, where a common dictionary is shared
amongst one or more record batches. In Parquet, the dictionary will
likely change from row group to row group. So, in general, the only
reliably correct way to decode the Parquet file is to decode the
dictionary encoded values into dense / materialized form.

We have some JIRAs open about passing through dictionary indices to
Arrow without decoding (which can cause memory use problems when you
have a lot of strings). This is doable, but it's quite a lot of work
because we must account for the case where the dictionary changes when
reading from the next row group. We also cannot determine the
in-memory C++ Arrow schema from the Parquet metadata alone (since we
need to see the data to determine the dictionary)

- Wes

On Tue, May 29, 2018 at 9:01 AM, Bryant Menn <bryant.m...@gmail.com> wrote:
> Following up on what I have found with Uwe's advice and poking around the
> code base.
>
> * `columns=` helped but it was because forced me to realize I did not need
> all of the columns at once every time. No particular column was
> significantly worse in memory usage.
> * There seems to be some interaction between
> `parquet::internal::RecordReader` and `arrow::PoolBuffer` or
> `arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold
> the entire column in memory without compression/encoding even though Arrow
> supports dictionary encoding (and the column is dictionary encoded).
>
> I imagine `RecordReader` requests enough memory to hold the data without
> encoding/compression for good reason (perhaps more robust assumptions about
> the underlying memory pool?), but is there a way to request only the memory
> require for dictionary encoding when it is an option?
>
> My (incomplete) understanding comes from the surrounding lines here
> https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/record_reader.cc#L232
> .
>
> On Wed, Apr 25, 2018 at 2:23 PM Bryant Menn <bryant.m...@gmail.com> wrote:
>
>> Uwe,
>>
>> I'll try pinpointing things further with `columns=` and try to reproduce
>> what I find with data I can share.
>>
>> Thanks for the pointer.
>>
>> -Bryant
>>
>> On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>>> No, there is no need to pass any options on reading. Sometimes they are
>>> beneficial depending on what you want to achieve but defaults are ok, too.
>>>
>>> I'm not sure if you're able to post an example but it would be nice if
>>> you could post the resulting Arrow schema from the table. It might be
>>> related to a specific type. A quick way to debug this on your side would
>>> also be to specify only a subset of columns to read using the `columns=`
>>> attribute on read_table. Maybe you can already pinpoint the memory problems
>>> to a specific column. Having these hints would it make easier for us to
>>> diagnose what the underlying problem is.
>>>
>>> Uwe
>>>
>>> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
>>> > Uwe,
>>> >
>>> > I am not. Should I be? I forgot to mention earlier that the Parquet file
>>> > came from Spark/PySpark.
>>> >
>>> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>>> >
>>> > > Hello Bryant,
>>> > >
>>> > > are you using any options on `pyarrow.parquet.read_table` or a
>>> possible
>>> > > `to_pandas` afterwards?
>>> > >
>>> > > Uwe
>>> > >
>>> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
>>> > > > I tried reading a Parquet file (<200MB, lots of text with snappy)
>>> using
>>> > > > read_table and saw the memory usage peak over 8GB before settling
>>> back
>>> > > down
>>> > > > to ~200MB. This surprised me as I was expecting to be able to
>>> handle a
>>> > > > Parquet file of this size with much less RAM (doing some processing
>>> with
>>> > > > smaller VMs).
>>> > > >
>>> > > > I am not sure if this expected, but I thought I might check with
>>> everyone
>>> > > > here and learn something new. Poking around it seems to be related
>>> with
>>> > > > ParquetReader.read_all?
>>> > > >
>>> > > > Thanks in advance,
>>> > > > Bryant
>>> > >
>>>
>>

Re: Peak memory usage for pyarrow.parquet.read_table

Reply via email to