> (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps increasing during the scan (looks linear to the number of files scanned).
I wouldn't take this to mean a memory leak but the memory allocator not paging out virtual memory that has been allocated throughout the scan. Could you run your workload under a memory profiler? (3) Scan the same dataset twice in the same process doesn't increase the max rss. Another sign this isn't a leak, just the allocator reaching a level of memory commitment that it doesn't feel like undoing. -- Felipe On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> wrote: > Hello, > > I have been testing "What is the max rss needed to scan through ~100G of > data in a parquet stored in gcs using Arrow C++". > > The current answer is about ~6G of memory which seems a bit high so I > looked into it. What I observed during the process led me to think that > there are some potential cache/memory issues in the dataset/parquet cpp > code. > > Main observation: > (1) As I am scanning through the dataset, I printed out (a) memory > allocated by the memory pool from ScanOptions (b) process rss. I found that > while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps > increasing during the scan (looks linear to the number of files scanned). > (2) I tested ScanNode in Arrow as well as an in-house library that > implements its own "S3Dataset" similar to Arrow dataset, both showing > similar rss usage. (Which led me to think the issue is more likely to be in > the parquet cpp code instead of dataset code). > (3) Scan the same dataset twice in the same process doesn't increase the > max rss. > > I plan to look into the parquet cpp/dataset code but I wonder if someone > has some clues what the issue might be or where to look at? > > Thanks, > Li >