Thanks for the input, Weston!
> it might be worth just inspecting the RSS usage of your benchmarks. > Had initially planned that, but was working on MacOS and didn't have the patience to download XCode to get access to Instruments. But I'll run it on Linux and that should make it easier to measure. we should recommend users also measure the performance impact of these > changes. > Yes, turning these settings on may reduce memory usage at the cost of being slower; that's why they are not the default. How much slower likely depends on the filesystem and the contents of the Parquet file. And any of these changes could have unexpected interactions with other parts of your application. If the goal is to reduce memory then users might want to also think about > dictionary encoding for string/binary columns. > I hadn't included that in the scope of my tests, but it's a good point. I had the thought a while ago about writing some blog post about how to efficiently represent data in Arrow, taking advantage of various types. But an interesting extension is thinking about how that interacts with Parquet. On Thu, Aug 11, 2022 at 8:06 PM Weston Pace <weston.p...@gmail.com> wrote: > Just a few additional thoughts: > > > at least as measured by > > the memory pools max_memory() method. > > The parquet reader does a fair amount of allocation on the global > system allocator (i.e. not using a memory pool). Typically this > should be small in comparison with the data buffers themselves (which > will be allocated on memory pools) but it might be worth just > inspecting the RSS usage of your benchmarks. > > > (1) Turn off prebuffering, (2) Read data in batches, and > > (3) turn on buffered_stream. > > This might go without saying but if we're going to include it in our > docs we should recommend users also measure the performance impact of > these changes. > > > If there's no further input > > If the goal is to reduce memory then users might want to also think > about dictionary encoding for string/binary columns. I'm not entirely > sure how the properties work but I think you can force Arrow to read > certain columns as dictionary encoded (I could be very wrong here). > > On Thu, Aug 11, 2022 at 12:26 PM Will Jones <will.jones...@gmail.com> > wrote: > > > > Hi all, > > > > I found my issue: I was not actually passing down the > > ArrowReaderProperties. I can now see that lowering batch_size > meaningfully > > reduces memory usage [1]. I still see more memory used when reading files > > with larger row groups, keeping the batch size constant. > > > > Overall I found that users who want to keep memory usage down when > reading > > Parquet should: (1) Turn off prebuffering, (2) Read data in batches, and > > (3) turn on buffered_stream. > > > > If there's no further input, I may add these suggestions to our docs. > > > > [1] > > > https://github.com/wjones127/arrow-parquet-memory-bench/blob/7e0d740a09c8042da647a0de1f285b6bb8a7f4db/readme_files/figure-gfm/group-size-1.png > > > > On Tue, Aug 9, 2022 at 4:11 PM Will Jones <will.jones...@gmail.com> > wrote: > > > > > I did some experiments to try to understand what controls a user has to > > > constrain how much memory our Parquet readers use, at least as > measured by > > > the memory pools max_memory() method. > > > > > > I was surprised to find that parquet::ArrowReaderProperties.batch_size > > > didn't have much of an effect at all on the peak memory usage [1]. The > code > > > I ran was [2]. > > > > > > Two questions: > > > > > > 1. Is this expected? Or does it sound like I did something wrong? > > > 2. Is there a way we could make it so that setting a smaller batch size > > > reduced the memory required to read into a record batch stream? > > > > > > I created a repo for these tests at [3]. > > > > > > [1] > > > > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/readme_files/figure-gfm/group-size-1.png > > > [2] > > > > https://github.com/wjones127/arrow-parquet-memory-bench/blob/5434f9f642c452470aa18ca872e9acd0d7462a1a/src/main.cc#L51-L66 > > > [3] https://github.com/wjones127/arrow-parquet-memory-bench > > > >