Following up on what I have found with Uwe's advice and poking around the code base.
* `columns=` helped but it was because forced me to realize I did not need all of the columns at once every time. No particular column was significantly worse in memory usage. * There seems to be some interaction between `parquet::internal::RecordReader` and `arrow::PoolBuffer` or `arrow::DefaultMemoryPool`. `RecordReader` request an allocation to hold the entire column in memory without compression/encoding even though Arrow supports dictionary encoding (and the column is dictionary encoded). I imagine `RecordReader` requests enough memory to hold the data without encoding/compression for good reason (perhaps more robust assumptions about the underlying memory pool?), but is there a way to request only the memory require for dictionary encoding when it is an option? My (incomplete) understanding comes from the surrounding lines here https://github.com/apache/parquet-cpp/blob/c405bf36506ec584e8009a6d53349277e600467d/src/parquet/arrow/record_reader.cc#L232 . On Wed, Apr 25, 2018 at 2:23 PM Bryant Menn <bryant.m...@gmail.com> wrote: > Uwe, > > I'll try pinpointing things further with `columns=` and try to reproduce > what I find with data I can share. > > Thanks for the pointer. > > -Bryant > > On Wed, Apr 25, 2018 at 2:10 PM Uwe L. Korn <uw...@xhochy.com> wrote: > >> No, there is no need to pass any options on reading. Sometimes they are >> beneficial depending on what you want to achieve but defaults are ok, too. >> >> I'm not sure if you're able to post an example but it would be nice if >> you could post the resulting Arrow schema from the table. It might be >> related to a specific type. A quick way to debug this on your side would >> also be to specify only a subset of columns to read using the `columns=` >> attribute on read_table. Maybe you can already pinpoint the memory problems >> to a specific column. Having these hints would it make easier for us to >> diagnose what the underlying problem is. >> >> Uwe >> >> On Wed, Apr 25, 2018, at 8:06 PM, Bryant Menn wrote: >> > Uwe, >> > >> > I am not. Should I be? I forgot to mention earlier that the Parquet file >> > came from Spark/PySpark. >> > >> > On Wed, Apr 25, 2018 at 1:32 PM Uwe L. Korn <uw...@xhochy.com> wrote: >> > >> > > Hello Bryant, >> > > >> > > are you using any options on `pyarrow.parquet.read_table` or a >> possible >> > > `to_pandas` afterwards? >> > > >> > > Uwe >> > > >> > > On Wed, Apr 25, 2018, at 7:27 PM, Bryant Menn wrote: >> > > > I tried reading a Parquet file (<200MB, lots of text with snappy) >> using >> > > > read_table and saw the memory usage peak over 8GB before settling >> back >> > > down >> > > > to ~200MB. This surprised me as I was expecting to be able to >> handle a >> > > > Parquet file of this size with much less RAM (doing some processing >> with >> > > > smaller VMs). >> > > > >> > > > I am not sure if this expected, but I thought I might check with >> everyone >> > > > here and learn something new. Poking around it seems to be related >> with >> > > > ParquetReader.read_all? >> > > > >> > > > Thanks in advance, >> > > > Bryant >> > > >> >