Re: Re: Reusing RecordBatch objects and their memory space

Weston Pace Fri, 12 May 2023 20:45:14 -0700

The mailing list cannot handle attachments and images.  Can you upload the
flame graphs to a gist?


On Fri, May 12, 2023 at 6:55 PM SHI BEI <shibei...@foxmail.com> wrote:

> What I meant is that shared_ptr has a large overhead, which is clearly
> reflected in the CPU flame graph. In my testing scenario, there are 10
> Parquet files, each with a size of 1.3GB and no compression applied to the
> data within the files. Each row group has 65536 rows in those files. In
> each test, all files are read 10 times to facilitate capturing the CPU
> flame graph. To verify the issue described above, I controlled the number
> of calls to the RecordBatchReader::ReadNext interface by adjusting the
> number of rows of data read each time. The CPU flame graph capture results
>  are as follows:
> 1)  batch_size = 2048
>
>
> 2) batch_size = 65536
>
>
>
> ------------------------------
>
> SHI BEI
> shibei...@foxmail.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=SHI+BEI&icon=https%3A%2F%2Fthirdwx.qlogo.cn%2Fmmopen%2Fvi_32%2FQ0j4TwGTfTIXtiatj6eqJnThGs5GrPTyWewWVqE1snw8hJDmnePicI611Zvub05AnTGjcJ5xCNlkD6uezOVoA2Gw%2F132%3Frand%3D1646387113%3Frand%3D1646387124%3Frand%3D1646387148&mail=shibei.lh%40foxmail.com&code=KAmESwJvMrwAxwnQWafGjlsCzQ9tgHLSs7s2ohGx7ou54B0-ZyrWJkTg5npy2p1LmT5WQjSlhwncoGhA6w_xb-hQTDq6tGNfwF1sIGtP_HQ>
>
>
>
>
> 原始邮件
>
> 发件人："Weston Pace"< weston.p...@gmail.com >;
>
> 发件时间：2023/5/13 2:30
>
> 收件人："dev"< dev@arrow.apache.org >;
>
> 主题：Re: Reusing RecordBatch objects and their memory space
>
> I think there are perhaps various things being discussed here:
>
> * Reusing large blocks of memory
>
> I don't think the memory pools actually provide this kind of reuse (e.g.
> they aren't like "connection pools" or "thread pools"). I'm pretty sure,
> when you allocate a new buffer on a pool, it always triggers an allocation
> on the underlying allocator. Now, that being said, I think this is
> generally fine. Allocators themselves (e.g. malloc, jemalloc) will keep
> and reuse blocks of memory before returning it to the OS. Though this can
> be difficult due to things like fragmentation.
>
> One potential exception to the "let allocators handle the reuse" rule would
> be cases where you are frequently allocating buffers that are the exact
> same size (or you are ok with the buffers being larger than you need so you
> can reuse them). For example, packet pools are very common in network
> programming. In this case, you can perhaps be more efficient than the
> allocator, since you know the buffers have the same size.
>
> It's not entirely clear to me that this would be useful in reading parquet.
>
> * shared_ptr overhead
>
> Everytime a shared_ptr is created there is an atomic increment of the ref
> counter. Everytime it is destroyed there is an atomic decrement. These
> atomic increments/decrements introduce memory fences which can foil
> compiler optimizations and just be costly on their own.
>
> > I'm using the RecordBatchReader::ReadNext interface to read Parquet
> data in my project, and I've noticed that there are a lot of temporary
> object destructors being generated during usage.
>
> Can you clarify what you mean here? When I read this sentence I thought of
> something completely different than the previous two things mentioned :)
> At one time I had a suspicion that thrift was generating a lot of small
> allocations reading the parquet metadata and that this was leading to
> fragmentation of the system allocator (thrift's allocations do not go
> through the memory pool / jemalloc and we have a bit of a habit in datasets
> of keeping parquet metadata around to speed up future reads). I never did
> investigate this further though.
>
> On Fri, May 12, 2023 at 10:48 AM David Li wrote:
>
> > I can't find it anymore, but there is a quite old issue that made the
> same
> > observation: RecordBatch's heavy use of shared_ptr in C++ can lead to a
> lot
> > of overhead just calling destructors. That may be something to explore
> more
> > (e.g. I think someone had tried to "unbox" some of the fields in
> > RecordBatch).
> >
> > On Fri, May 12, 2023, at 13:04, Will Jones wrote:
> > > Hello,
> > >
> > > I'm not sure if there are easy ways to avoid calling the destructors.
> > > However, I would point out memory space reuse is handled through memory
> > > pools; if you have one enabled it shouldn't be handing memory back to
> the
> > > OS between each iteration.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Fri, May 12, 2023 at 9:59 AM SHI BEI wrote:
> > >
> > >> Hi community,
> > >>
> > >>
> > >> I'm using the RecordBatchReader::ReadNext interface to read Parquet
> > >> data in my project, and I've noticed that there are a lot of temporary
> > >> object destructors being generated during usage. Has the community
> > >> considered providing an interface to reuse RecordBatch objects
> > >> and their memory space for storing data?
> > >>
> > >>
> > >>
> > >>
> > >> SHI BEI
> > >> shibei...@foxmail.com
> >
>
>

Re: Re: Reusing RecordBatch objects and their memory space

Reply via email to