Hi Jin,

Do you have more information about the parquet file? What came to
my mind is this issue: https://github.com/apache/arrow/issues/35393
If you have observed something, please feel free to create a new issue
and post what you have found there.

Thanks,
Gang

On Wed, Sep 6, 2023 at 11:56 PM Li Jin <ice.xell...@gmail.com> wrote:

> Hello,
>
> I have been testing "What is the max rss needed to scan through ~100G of
> data in a parquet stored in gcs using Arrow C++".
>
> The current answer is about ~6G of memory which seems a bit high so I
> looked into it. What I observed during the process led me to think that
> there are some potential cache/memory issues in the dataset/parquet cpp
> code.
>
> Main observation:
> (1) As I am scanning through the dataset, I printed out (a) memory
> allocated by the memory pool from ScanOptions (b) process rss. I found that
> while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
> increasing during the scan (looks linear to the number of files scanned).
> (2) I tested ScanNode in Arrow as well as an in-house library that
> implements its own "S3Dataset" similar to Arrow dataset, both showing
> similar rss usage. (Which led me to think the issue is more likely to be in
> the parquet cpp code instead of dataset code).
> (3) Scan the same dataset twice in the same process doesn't increase the
> max rss.
>
> I plan to look into the parquet cpp/dataset code but I wonder if someone
> has some clues what the issue might be or where to look at?
>
> Thanks,
> Li
>

Reply via email to