Reporting back with some new findings. Re Felipe and Antione: I tried with both Antione's suggestions (swapping the default allocator and calls ReleaseUnused but neither seem to affect the max rss. In addition, I manage to repro the issue by reading a list of n local parquet files that point to the same file, i.e., {"a.parquet", "a.parquet", ... }. I am also able to crash my process by reading and passing a large enough n. (I observed rss keep going up and eventually the process gets killed). This observation led me to think there might actually be some memory leak issues.
Re Xuwei: Thanks for the tips. I am gonna try to memorize this profile next and see what I can find. I am gonna keep looking into this but again, any ideas / suggestions are appreciated (and thanks for all the help so far!) Li On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: > Thanks all for the additional suggestions. Will try it but want to answer > Antoine's question first: > > > Which leads to the question: what is your OS? > > I am testing this on Debian 5.4.228 x86_64 GNU/Linux > > On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com> wrote: > >> By the way, you can try to use a memory-profiler like [1] and [2] . >> It would be help to find how the memory is used >> >> Best, >> Xuwei Fu >> >> [1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling >> [2] https://google.github.io/tcmalloc/gperftools.html >> >> >> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 00:28写道: >> >> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps >> > increasing during the scan (looks linear to the number of files >> scanned). >> > >> > I wouldn't take this to mean a memory leak but the memory allocator not >> > paging out virtual memory that has been allocated throughout the scan. >> > Could you run your workload under a memory profiler? >> > >> > (3) Scan the same dataset twice in the same process doesn't increase the >> > max rss. >> > >> > Another sign this isn't a leak, just the allocator reaching a level of >> > memory commitment that it doesn't feel like undoing. >> > >> > -- >> > Felipe >> > >> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> wrote: >> > >> > > Hello, >> > > >> > > I have been testing "What is the max rss needed to scan through ~100G >> of >> > > data in a parquet stored in gcs using Arrow C++". >> > > >> > > The current answer is about ~6G of memory which seems a bit high so I >> > > looked into it. What I observed during the process led me to think >> that >> > > there are some potential cache/memory issues in the dataset/parquet >> cpp >> > > code. >> > > >> > > Main observation: >> > > (1) As I am scanning through the dataset, I printed out (a) memory >> > > allocated by the memory pool from ScanOptions (b) process rss. I found >> > that >> > > while (a) stays pretty stable throughout the scan (stays < 1G), (b) >> keeps >> > > increasing during the scan (looks linear to the number of files >> scanned). >> > > (2) I tested ScanNode in Arrow as well as an in-house library that >> > > implements its own "S3Dataset" similar to Arrow dataset, both showing >> > > similar rss usage. (Which led me to think the issue is more likely to >> be >> > in >> > > the parquet cpp code instead of dataset code). >> > > (3) Scan the same dataset twice in the same process doesn't increase >> the >> > > max rss. >> > > >> > > I plan to look into the parquet cpp/dataset code but I wonder if >> someone >> > > has some clues what the issue might be or where to look at? >> > > >> > > Thanks, >> > > Li >> > > >> > >> >