Re: Peak memory usage for pyarrow.parquet.read_table

Uwe L. Korn Wed, 25 Apr 2018 11:11:11 -0700

No, there is no need to pass any options on reading. Sometimes they are 
beneficial depending on what you want to achieve but defaults are ok, too.

I'm not sure if you're able to post an example but it would be nice if you 
could post the resulting Arrow schema from the table. It might be related to a 
specific type. A quick way to debug this on your side would also be to specify 
only a subset of columns to read using the `columns=` attribute on read_table. 
Maybe you can already pinpoint the memory problems to a specific column. Having 
these hints would it make easier for us to diagnose what the underlying problem 
is.

Uwe

On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote:
> Uwe,
> 
> I am not. Should I be? I forgot to mention earlier that the Parquet file
> came from Spark/PySpark.
> 
> On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> > Hello Bryant,
> >
> > are you using any options on `pyarrow.parquet.read_table` or a possible
> > `to_pandas` afterwards?
> >
> > Uwe
> >
> > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote:
> > > I tried reading a Parquet file (<200MB, lots of text with snappy) using
> > > read_table and saw the memory usage peak over 8GB before settling back
> > down
> > > to ~200MB. This surprised me as I was expecting to be able to handle a
> > > Parquet file of this size with much less RAM (doing some processing with
> > > smaller VMs).
> > >
> > > I am not sure if this expected, but I thought I might check with everyone
> > > here and learn something new. Poking around it seems to be related with
> > > ParquetReader.read_all?
> > >
> > > Thanks in advance,
> > > Bryant
> >

Re: Peak memory usage for pyarrow.parquet.read_table

Reply via email to