As suggested from other comments, I also highly recommend using a heap profiling tool to investigate what's going on there.
BTW, 800 columns look suspicious to me. Could you try to test them without reading any batch? Not sure if the file metadata is the root cause. Or you may want to try another dataset with a smaller number of columns. On Thu, Sep 7, 2023 at 5:45 AM Li Jin <ice.xell...@gmail.com> wrote: > Correction: > > > I tried with both Antione's suggestions (swapping the default allocator > and calls ReleaseUnused but neither seem to affect the max rss. > > Calling ReleaseUnused does have some effect on the rss - the max rss goes > from ~6G -> 5G but still there seems to be something else. > > On Wed, Sep 6, 2023 at 4:35 PM Li Jin <ice.xell...@gmail.com> wrote: > > > Also attaching my experiment code just in case: > > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 > > > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin <ice.xell...@gmail.com> wrote: > > > >> Reporting back with some new findings. > >> > >> Re Felipe and Antione: > >> I tried with both Antione's suggestions (swapping the default allocator > >> and calls ReleaseUnused but neither seem to affect the max rss. In > >> addition, I manage to repro the issue by reading a list of n local > parquet > >> files that point to the same file, i.e., {"a.parquet", "a.parquet", ... > }. > >> I am also able to crash my process by reading and passing a large > enough n. > >> (I observed rss keep going up and eventually the process gets killed). > This > >> observation led me to think there might actually be some memory leak > issues. > >> > >> Re Xuwei: > >> Thanks for the tips. I am gonna try to memorize this profile next and > see > >> what I can find. > >> > >> I am gonna keep looking into this but again, any ideas / suggestions are > >> appreciated (and thanks for all the help so far!) > >> > >> Li > >> > >> > >> > >> > >> > >> > >> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: > >> > >>> Thanks all for the additional suggestions. Will try it but want to > >>> answer Antoine's question first: > >>> > >>> > Which leads to the question: what is your OS? > >>> > >>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux > >>> > >>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com> > >>> wrote: > >>> > >>>> By the way, you can try to use a memory-profiler like [1] and [2] . > >>>> It would be help to find how the memory is used > >>>> > >>>> Best, > >>>> Xuwei Fu > >>>> > >>>> [1] > >>>> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling > >>>> [2] https://google.github.io/tcmalloc/gperftools.html > >>>> > >>>> > >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 00:28写道: > >>>> > >>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) > keeps > >>>> > increasing during the scan (looks linear to the number of files > >>>> scanned). > >>>> > > >>>> > I wouldn't take this to mean a memory leak but the memory allocator > >>>> not > >>>> > paging out virtual memory that has been allocated throughout the > scan. > >>>> > Could you run your workload under a memory profiler? > >>>> > > >>>> > (3) Scan the same dataset twice in the same process doesn't increase > >>>> the > >>>> > max rss. > >>>> > > >>>> > Another sign this isn't a leak, just the allocator reaching a level > of > >>>> > memory commitment that it doesn't feel like undoing. > >>>> > > >>>> > -- > >>>> > Felipe > >>>> > > >>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> > wrote: > >>>> > > >>>> > > Hello, > >>>> > > > >>>> > > I have been testing "What is the max rss needed to scan through > >>>> ~100G of > >>>> > > data in a parquet stored in gcs using Arrow C++". > >>>> > > > >>>> > > The current answer is about ~6G of memory which seems a bit high > so > >>>> I > >>>> > > looked into it. What I observed during the process led me to think > >>>> that > >>>> > > there are some potential cache/memory issues in the > dataset/parquet > >>>> cpp > >>>> > > code. > >>>> > > > >>>> > > Main observation: > >>>> > > (1) As I am scanning through the dataset, I printed out (a) memory > >>>> > > allocated by the memory pool from ScanOptions (b) process rss. I > >>>> found > >>>> > that > >>>> > > while (a) stays pretty stable throughout the scan (stays < 1G), > (b) > >>>> keeps > >>>> > > increasing during the scan (looks linear to the number of files > >>>> scanned). > >>>> > > (2) I tested ScanNode in Arrow as well as an in-house library that > >>>> > > implements its own "S3Dataset" similar to Arrow dataset, both > >>>> showing > >>>> > > similar rss usage. (Which led me to think the issue is more likely > >>>> to be > >>>> > in > >>>> > > the parquet cpp code instead of dataset code). > >>>> > > (3) Scan the same dataset twice in the same process doesn't > >>>> increase the > >>>> > > max rss. > >>>> > > > >>>> > > I plan to look into the parquet cpp/dataset code but I wonder if > >>>> someone > >>>> > > has some clues what the issue might be or where to look at? > >>>> > > > >>>> > > Thanks, > >>>> > > Li > >>>> > > > >>>> > > >>>> > >>> >