Update:

I have done a memory profiling and the result seems to suggest memory leak.
I
 have opened a issue to further discuss this:
https://github.com/apache/arrow/issues/37630


On Fri, Sep 8, 2023 at 10:04 AM Li Jin <ice.xell...@gmail.com> wrote:

> Update:
>
> I have done a memory profiling and the result seems to suggest memory
> leak. I
>  have opened a issue to further discuss this:
> https://github.com/apache/arrow/issues/37630
>
> Attaching the memory profiling result here as well:
>
> On Wed, Sep 6, 2023 at 9:18 PM Gang Wu <ust...@gmail.com> wrote:
>
>> As suggested from other comments, I also highly recommend using a
>> heap profiling tool to investigate what's going on there.
>>
>> BTW, 800 columns look suspicious to me. Could you try to test them
>> without reading any batch? Not sure if the file metadata is the root
>> cause. Or you may want to try another dataset with a smaller number
>> of columns.
>>
>> On Thu, Sep 7, 2023 at 5:45 AM Li Jin <ice.xell...@gmail.com> wrote:
>>
>> > Correction:
>> >
>> > > I tried with both Antione's suggestions (swapping the default
>> allocator
>> > and calls ReleaseUnused but neither seem to affect the max rss.
>> >
>> > Calling ReleaseUnused does have some effect on the rss - the max rss
>> goes
>> > from ~6G -> 5G but still there seems to be something else.
>> >
>> > On Wed, Sep 6, 2023 at 4:35 PM Li Jin <ice.xell...@gmail.com> wrote:
>> >
>> > > Also attaching my experiment code just in case:
>> > > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43
>> > >
>> > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin <ice.xell...@gmail.com> wrote:
>> > >
>> > >> Reporting back with some new findings.
>> > >>
>> > >> Re Felipe and Antione:
>> > >> I tried with both Antione's suggestions (swapping the default
>> allocator
>> > >> and calls ReleaseUnused but neither seem to affect the max rss. In
>> > >> addition, I manage to repro the issue by reading a list of n local
>> > parquet
>> > >> files that point to the same file, i.e., {"a.parquet", "a.parquet",
>> ...
>> > }.
>> > >> I am also able to crash my process by reading and passing a large
>> > enough n.
>> > >> (I observed rss keep going up and eventually the process gets
>> killed).
>> > This
>> > >> observation led me to think there might actually be some memory leak
>> > issues.
>> > >>
>> > >> Re Xuwei:
>> > >> Thanks for the tips. I am gonna try to memorize this profile next and
>> > see
>> > >> what I can find.
>> > >>
>> > >> I am gonna keep looking into this but again, any ideas / suggestions
>> are
>> > >> appreciated (and thanks for all the help so far!)
>> > >>
>> > >> Li
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
>> > >>
>> > >>> Thanks all for the additional suggestions. Will try it but want to
>> > >>> answer Antoine's question first:
>> > >>>
>> > >>> > Which leads to the question: what is your OS?
>> > >>>
>> > >>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux
>> > >>>
>> > >>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>>> By the way, you can try to use a memory-profiler like [1] and [2] .
>> > >>>> It would be help to find how the memory is used
>> > >>>>
>> > >>>> Best,
>> > >>>> Xuwei Fu
>> > >>>>
>> > >>>> [1]
>> > >>>>
>> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling
>> > >>>> [2] https://google.github.io/tcmalloc/gperftools.html
>> > >>>>
>> > >>>>
>> > >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四
>> 00:28写道:
>> > >>>>
>> > >>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b)
>> > keeps
>> > >>>> > increasing during the scan (looks linear to the number of files
>> > >>>> scanned).
>> > >>>> >
>> > >>>> > I wouldn't take this to mean a memory leak but the memory
>> allocator
>> > >>>> not
>> > >>>> > paging out virtual memory that has been allocated throughout the
>> > scan.
>> > >>>> > Could you run your workload under a memory profiler?
>> > >>>> >
>> > >>>> > (3) Scan the same dataset twice in the same process doesn't
>> increase
>> > >>>> the
>> > >>>> > max rss.
>> > >>>> >
>> > >>>> > Another sign this isn't a leak, just the allocator reaching a
>> level
>> > of
>> > >>>> > memory commitment that it doesn't feel like undoing.
>> > >>>> >
>> > >>>> > --
>> > >>>> > Felipe
>> > >>>> >
>> > >>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com>
>> > wrote:
>> > >>>> >
>> > >>>> > > Hello,
>> > >>>> > >
>> > >>>> > > I have been testing "What is the max rss needed to scan through
>> > >>>> ~100G of
>> > >>>> > > data in a parquet stored in gcs using Arrow C++".
>> > >>>> > >
>> > >>>> > > The current answer is about ~6G of memory which seems a bit
>> high
>> > so
>> > >>>> I
>> > >>>> > > looked into it. What I observed during the process led me to
>> think
>> > >>>> that
>> > >>>> > > there are some potential cache/memory issues in the
>> > dataset/parquet
>> > >>>> cpp
>> > >>>> > > code.
>> > >>>> > >
>> > >>>> > > Main observation:
>> > >>>> > > (1) As I am scanning through the dataset, I printed out (a)
>> memory
>> > >>>> > > allocated by the memory pool from ScanOptions (b) process rss.
>> I
>> > >>>> found
>> > >>>> > that
>> > >>>> > > while (a) stays pretty stable throughout the scan (stays < 1G),
>> > (b)
>> > >>>> keeps
>> > >>>> > > increasing during the scan (looks linear to the number of files
>> > >>>> scanned).
>> > >>>> > > (2) I tested ScanNode in Arrow as well as an in-house library
>> that
>> > >>>> > > implements its own "S3Dataset" similar to Arrow dataset, both
>> > >>>> showing
>> > >>>> > > similar rss usage. (Which led me to think the issue is more
>> likely
>> > >>>> to be
>> > >>>> > in
>> > >>>> > > the parquet cpp code instead of dataset code).
>> > >>>> > > (3) Scan the same dataset twice in the same process doesn't
>> > >>>> increase the
>> > >>>> > > max rss.
>> > >>>> > >
>> > >>>> > > I plan to look into the parquet cpp/dataset code but I wonder
>> if
>> > >>>> someone
>> > >>>> > > has some clues what the issue might be or where to look at?
>> > >>>> > >
>> > >>>> > > Thanks,
>> > >>>> > > Li
>> > >>>> > >
>> > >>>> >
>> > >>>>
>> > >>>
>> >
>>
>

Reply via email to