Update: I have done a memory profiling and the result seems to suggest memory leak. I have opened a issue to further discuss this: https://github.com/apache/arrow/issues/37630
On Fri, Sep 8, 2023 at 10:04 AM Li Jin <ice.xell...@gmail.com> wrote: > Update: > > I have done a memory profiling and the result seems to suggest memory > leak. I > have opened a issue to further discuss this: > https://github.com/apache/arrow/issues/37630 > > Attaching the memory profiling result here as well: > > On Wed, Sep 6, 2023 at 9:18 PM Gang Wu <ust...@gmail.com> wrote: > >> As suggested from other comments, I also highly recommend using a >> heap profiling tool to investigate what's going on there. >> >> BTW, 800 columns look suspicious to me. Could you try to test them >> without reading any batch? Not sure if the file metadata is the root >> cause. Or you may want to try another dataset with a smaller number >> of columns. >> >> On Thu, Sep 7, 2023 at 5:45 AM Li Jin <ice.xell...@gmail.com> wrote: >> >> > Correction: >> > >> > > I tried with both Antione's suggestions (swapping the default >> allocator >> > and calls ReleaseUnused but neither seem to affect the max rss. >> > >> > Calling ReleaseUnused does have some effect on the rss - the max rss >> goes >> > from ~6G -> 5G but still there seems to be something else. >> > >> > On Wed, Sep 6, 2023 at 4:35 PM Li Jin <ice.xell...@gmail.com> wrote: >> > >> > > Also attaching my experiment code just in case: >> > > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 >> > > >> > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin <ice.xell...@gmail.com> wrote: >> > > >> > >> Reporting back with some new findings. >> > >> >> > >> Re Felipe and Antione: >> > >> I tried with both Antione's suggestions (swapping the default >> allocator >> > >> and calls ReleaseUnused but neither seem to affect the max rss. In >> > >> addition, I manage to repro the issue by reading a list of n local >> > parquet >> > >> files that point to the same file, i.e., {"a.parquet", "a.parquet", >> ... >> > }. >> > >> I am also able to crash my process by reading and passing a large >> > enough n. >> > >> (I observed rss keep going up and eventually the process gets >> killed). >> > This >> > >> observation led me to think there might actually be some memory leak >> > issues. >> > >> >> > >> Re Xuwei: >> > >> Thanks for the tips. I am gonna try to memorize this profile next and >> > see >> > >> what I can find. >> > >> >> > >> I am gonna keep looking into this but again, any ideas / suggestions >> are >> > >> appreciated (and thanks for all the help so far!) >> > >> >> > >> Li >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: >> > >> >> > >>> Thanks all for the additional suggestions. Will try it but want to >> > >>> answer Antoine's question first: >> > >>> >> > >>> > Which leads to the question: what is your OS? >> > >>> >> > >>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux >> > >>> >> > >>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com> >> > >>> wrote: >> > >>> >> > >>>> By the way, you can try to use a memory-profiler like [1] and [2] . >> > >>>> It would be help to find how the memory is used >> > >>>> >> > >>>> Best, >> > >>>> Xuwei Fu >> > >>>> >> > >>>> [1] >> > >>>> >> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling >> > >>>> [2] https://google.github.io/tcmalloc/gperftools.html >> > >>>> >> > >>>> >> > >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 >> 00:28写道: >> > >>>> >> > >>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) >> > keeps >> > >>>> > increasing during the scan (looks linear to the number of files >> > >>>> scanned). >> > >>>> > >> > >>>> > I wouldn't take this to mean a memory leak but the memory >> allocator >> > >>>> not >> > >>>> > paging out virtual memory that has been allocated throughout the >> > scan. >> > >>>> > Could you run your workload under a memory profiler? >> > >>>> > >> > >>>> > (3) Scan the same dataset twice in the same process doesn't >> increase >> > >>>> the >> > >>>> > max rss. >> > >>>> > >> > >>>> > Another sign this isn't a leak, just the allocator reaching a >> level >> > of >> > >>>> > memory commitment that it doesn't feel like undoing. >> > >>>> > >> > >>>> > -- >> > >>>> > Felipe >> > >>>> > >> > >>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> >> > wrote: >> > >>>> > >> > >>>> > > Hello, >> > >>>> > > >> > >>>> > > I have been testing "What is the max rss needed to scan through >> > >>>> ~100G of >> > >>>> > > data in a parquet stored in gcs using Arrow C++". >> > >>>> > > >> > >>>> > > The current answer is about ~6G of memory which seems a bit >> high >> > so >> > >>>> I >> > >>>> > > looked into it. What I observed during the process led me to >> think >> > >>>> that >> > >>>> > > there are some potential cache/memory issues in the >> > dataset/parquet >> > >>>> cpp >> > >>>> > > code. >> > >>>> > > >> > >>>> > > Main observation: >> > >>>> > > (1) As I am scanning through the dataset, I printed out (a) >> memory >> > >>>> > > allocated by the memory pool from ScanOptions (b) process rss. >> I >> > >>>> found >> > >>>> > that >> > >>>> > > while (a) stays pretty stable throughout the scan (stays < 1G), >> > (b) >> > >>>> keeps >> > >>>> > > increasing during the scan (looks linear to the number of files >> > >>>> scanned). >> > >>>> > > (2) I tested ScanNode in Arrow as well as an in-house library >> that >> > >>>> > > implements its own "S3Dataset" similar to Arrow dataset, both >> > >>>> showing >> > >>>> > > similar rss usage. (Which led me to think the issue is more >> likely >> > >>>> to be >> > >>>> > in >> > >>>> > > the parquet cpp code instead of dataset code). >> > >>>> > > (3) Scan the same dataset twice in the same process doesn't >> > >>>> increase the >> > >>>> > > max rss. >> > >>>> > > >> > >>>> > > I plan to look into the parquet cpp/dataset code but I wonder >> if >> > >>>> someone >> > >>>> > > has some clues what the issue might be or where to look at? >> > >>>> > > >> > >>>> > > Thanks, >> > >>>> > > Li >> > >>>> > > >> > >>>> > >> > >>>> >> > >>> >> > >> >