Thanks all for the additional suggestions. Will try it but want to answer
Antoine's question first:

> Which leads to the question: what is your OS?

I am testing this on Debian 5.4.228 x86_64 GNU/Linux

On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com> wrote:

> By the way, you can try to use a memory-profiler like [1] and [2] .
> It would be help to find how the memory is used
>
> Best,
> Xuwei Fu
>
> [1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling
> [2] https://google.github.io/tcmalloc/gperftools.html
>
>
> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 00:28写道:
>
> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
> > increasing during the scan (looks linear to the number of files scanned).
> >
> > I wouldn't take this to mean a memory leak but the memory allocator not
> > paging out virtual memory that has been allocated throughout the scan.
> > Could you run your workload under a memory profiler?
> >
> > (3) Scan the same dataset twice in the same process doesn't increase the
> > max rss.
> >
> > Another sign this isn't a leak, just the allocator reaching a level of
> > memory commitment that it doesn't feel like undoing.
> >
> > --
> > Felipe
> >
> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I have been testing "What is the max rss needed to scan through ~100G
> of
> > > data in a parquet stored in gcs using Arrow C++".
> > >
> > > The current answer is about ~6G of memory which seems a bit high so I
> > > looked into it. What I observed during the process led me to think that
> > > there are some potential cache/memory issues in the dataset/parquet cpp
> > > code.
> > >
> > > Main observation:
> > > (1) As I am scanning through the dataset, I printed out (a) memory
> > > allocated by the memory pool from ScanOptions (b) process rss. I found
> > that
> > > while (a) stays pretty stable throughout the scan (stays < 1G), (b)
> keeps
> > > increasing during the scan (looks linear to the number of files
> scanned).
> > > (2) I tested ScanNode in Arrow as well as an in-house library that
> > > implements its own "S3Dataset" similar to Arrow dataset, both showing
> > > similar rss usage. (Which led me to think the issue is more likely to
> be
> > in
> > > the parquet cpp code instead of dataset code).
> > > (3) Scan the same dataset twice in the same process doesn't increase
> the
> > > max rss.
> > >
> > > I plan to look into the parquet cpp/dataset code but I wonder if
> someone
> > > has some clues what the issue might be or where to look at?
> > >
> > > Thanks,
> > > Li
> > >
> >
>

Reply via email to