Hello,

I have been testing "What is the max rss needed to scan through ~100G of
data in a parquet stored in gcs using Arrow C++".

The current answer is about ~6G of memory which seems a bit high so I
looked into it. What I observed during the process led me to think that
there are some potential cache/memory issues in the dataset/parquet cpp
code.

Main observation:
(1) As I am scanning through the dataset, I printed out (a) memory
allocated by the memory pool from ScanOptions (b) process rss. I found that
while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
increasing during the scan (looks linear to the number of files scanned).
(2) I tested ScanNode in Arrow as well as an in-house library that
implements its own "S3Dataset" similar to Arrow dataset, both showing
similar rss usage. (Which led me to think the issue is more likely to be in
the parquet cpp code instead of dataset code).
(3) Scan the same dataset twice in the same process doesn't increase the
max rss.

I plan to look into the parquet cpp/dataset code but I wonder if someone
has some clues what the issue might be or where to look at?

Thanks,
Li

Reply via email to