Re: Reusing RecordBatch objects and their memory space

Weston Pace Fri, 12 May 2023 11:33:08 -0700

I think there are perhaps various things being discussed here:

 * Reusing large blocks of memory

I don't think the memory pools actually provide this kind of reuse (e.g.
they aren't like "connection pools" or "thread pools").  I'm pretty sure,
when you allocate a new buffer on a pool, it always triggers an allocation
on the underlying allocator.  Now, that being said, I think this is
generally fine.  Allocators themselves (e.g. malloc, jemalloc) will keep
and reuse blocks of memory before returning it to the OS.  Though this can
be difficult due to things like fragmentation.

One potential exception to the "let allocators handle the reuse" rule would
be cases where you are frequently allocating buffers that are the exact
same size (or you are ok with the buffers being larger than you need so you
can reuse them).  For example, packet pools are very common in network
programming.  In this case, you can perhaps be more efficient than the
allocator, since you know the buffers have the same size.

It's not entirely clear to me that this would be useful in reading parquet.

 * shared_ptr overhead

Everytime a shared_ptr is created there is an atomic increment of the ref
counter.  Everytime it is destroyed there is an atomic decrement.  These
atomic increments/decrements introduce memory fences which can foil
compiler optimizations and just be costly on their own.

> I'm using the&nbsp;RecordBatchReader::ReadNext interface to read Parquet
data in my project, and I've noticed that there are a lot of temporary
object destructors being generated during usage.

Can you clarify what you mean here?  When I read this sentence I thought of
something completely different than the previous two things mentioned :)
At one time I had a suspicion that thrift was generating a lot of small
allocations reading the parquet metadata and that this was leading to
fragmentation of the system allocator (thrift's allocations do not go
through the memory pool / jemalloc and we have a bit of a habit in datasets
of keeping parquet metadata around to speed up future reads).  I never did
investigate this further though.

On Fri, May 12, 2023 at 10:48 AM David Li <lidav...@apache.org> wrote:

> I can't find it anymore, but there is a quite old issue that made the same
> observation: RecordBatch's heavy use of shared_ptr in C++ can lead to a lot
> of overhead just calling destructors. That may be something to explore more
> (e.g. I think someone had tried to "unbox" some of the fields in
> RecordBatch).
>
> On Fri, May 12, 2023, at 13:04, Will Jones wrote:
> > Hello,
> >
> > I'm not sure if there are easy ways to avoid calling the destructors.
> > However, I would point out memory space reuse is handled through memory
> > pools; if you have one enabled it shouldn't be handing memory back to the
> > OS between each iteration.
> >
> > Best,
> >
> > Will Jones
> >
> > On Fri, May 12, 2023 at 9:59 AM SHI BEI <shibei...@foxmail.com> wrote:
> >
> >> Hi community,
> >>
> >>
> >> I'm using the&nbsp;RecordBatchReader::ReadNext interface to read Parquet
> >> data in my project, and I've noticed that there are a lot of temporary
> >> object destructors being generated during usage. Has the community
> >> considered providing an interface to reuse&nbsp;RecordBatch&nbsp;objects
> >> and their memory space for storing data?
> >>
> >>
> >>
> >>
> >> SHI&nbsp;BEI
> >> shibei...@foxmail.com
>

Re: Reusing RecordBatch objects and their memory space

Reply via email to