Hi all, I found my issue: I was not actually passing down the ArrowReaderProperties. I can now see that lowering batch_size meaningfully reduces memory usage [1]. I still see more memory used when reading files with larger row groups, keeping the batch size constant.
Overall I found that users who want to keep memory usage down when reading Parquet should: (1) Turn off prebuffering, (2) Read data in batches, and (3) turn on buffered_stream. If there's no further input, I may add these suggestions to our docs. [1] https://github.com/wjones127/arrow-parquet-memory-bench/blob/7e0d740a09c8042da647a0de1f285b6bb8a7f4db/readme_files/figure-gfm/group-size-1.png On Tue, Aug 9, 2022 at 4:11 PM Will Jones <will.jones...@gmail.com> wrote: > I did some experiments to try to understand what controls a user has to > constrain how much memory our Parquet readers use, at least as measured by > the memory pools max_memory() method. > > I was surprised to find that parquet::ArrowReaderProperties.batch_size > didn't have much of an effect at all on the peak memory usage [1]. The code > I ran was [2]. > > Two questions: > > 1. Is this expected? Or does it sound like I did something wrong? > 2. Is there a way we could make it so that setting a smaller batch size > reduced the memory required to read into a record batch stream? > > I created a repo for these tests at [3]. > > [1] > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png > [2] > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66 > [3] https://github.com/wjones127/arrow-parquet-memory-bench >