Hi Li,

Le 06/09/2023 à 17:55, Li Jin a écrit :
Hello,

I have been testing "What is the max rss needed to scan through ~100G of
data in a parquet stored in gcs using Arrow C++".

The current answer is about ~6G of memory which seems a bit high so I
looked into it. What I observed during the process led me to think that
there are some potential cache/memory issues in the dataset/parquet cpp
code.

Main observation:
(1) As I am scanning through the dataset, I printed out (a) memory
allocated by the memory pool from ScanOptions (b) process rss. I found that
while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
increasing during the scan (looks linear to the number of files scanned).

RSS is typically not a very reliable indicator, because allocators tend to keep memory around as an allocation cache even if the application code deallocated it.

(in other words: the application returns memory to the allocator, but the allocator does not always return memory to the OS, because requesting memory from the OS is expensive)

You may start by trying a different memory pool (see https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL for an easy way to do that).

My second suggestion is to ask the memory pool to release more memory: https://arrow.apache.org/docs/cpp/api/memory.html#_CPPv4N5arrow10MemoryPool13ReleaseUnusedEv

If either of these two fix the apparent RSS issue, then there is no leak nor caching issue.

*However*, in addition to the Arrow memory pool, many small or medium-sized allocations (such as all Parquet Thrift metadata) will use the system allocator. These allocations will evade tracking by the memory pool.

Which leads to the question: what is your OS?

Regards

Antoine.

Reply via email to