Hello, I have been testing "What is the max rss needed to scan through ~100G of data in a parquet stored in gcs using Arrow C++".
The current answer is about ~6G of memory which seems a bit high so I looked into it. What I observed during the process led me to think that there are some potential cache/memory issues in the dataset/parquet cpp code. Main observation: (1) As I am scanning through the dataset, I printed out (a) memory allocated by the memory pool from ScanOptions (b) process rss. I found that while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps increasing during the scan (looks linear to the number of files scanned). (2) I tested ScanNode in Arrow as well as an in-house library that implements its own "S3Dataset" similar to Arrow dataset, both showing similar rss usage. (Which led me to think the issue is more likely to be in the parquet cpp code instead of dataset code). (3) Scan the same dataset twice in the same process doesn't increase the max rss. I plan to look into the parquet cpp/dataset code but I wonder if someone has some clues what the issue might be or where to look at? Thanks, Li