On Mon, Apr 6, 2020 at 12:22 PM Todd Lipcon <t...@cloudera.com.invalid> wrote:
>
> On Mon, Apr 6, 2020 at 9:57 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Hello Todd,
> >
> > Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> > >
> > > I had a couple questions / items that should be clarified in the spec.
> > Wes
> > > suggested I raise them here on dev@:
> > >
> > > *1) Should producers expect callers to zero-init structs?*
> >
> > IMO, they shouldn't.  They should fill the structure exhaustively.
> > Though ultimately it's a decision made by the implementer of the
> > producer API.
> >
> > > I suppose since it's the "C interface" it's
> > > probably best to follow the C-style "producer assumes the argument
> > contains
> > > uninitialized memory" convention.
> >
> > Exactly.
> >
> > > *2) Clarify lifetime semantics for nested structures*
> > >
> > > In my application, i'd like to allocate all of the children structures of
> > > an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> > > private_data field of the top-level struct. As such, my initial
> > > implementation was to make the 'release' callback on the top-level struct
> > > delete that memory pool, and set the 'release' callback of all children
> > > structs to null, since their memory was totally owned by the top-level
> > > struct.
> > >
> > > I figured this approach was OK because the spec says:
> > >
> > >>  Consumers MUST call a base structure's release callback when they won't
> > > be using it anymore, but they MUST not call any of its children's release
> > > callbacks (including the optional dictionary). The producer is
> > responsible
> > > for releasing the children.
> > >
> > > That advice seems to indicate that I can do whatever I want with the
> > > release callback of the children, including not setting it.
> >
> > ... Except that in this case, moving a child wouldn't be possible.
> >
> > > This section of the spec also seems to be a bit in conflict with the
> > > following:
> > >
> > >> It is possible to move a child array, but the parent array MUST be
> > > released immediately afterwards, as it won't point to a valid child array
> > > anymore. This satisfies the use case of keeping only a subset of child
> > > arrays, while releasing the others.
> > >
> > > ... because if you have a parent array which owns the memory referred to
> > by
> > > the child, then moving the child (with a no-op release callback) followed
> > > by releasing the parent, you'll end up with an invalid or deallocated
> > child
> > > as well.
> >
> > I think the solution here is for your memory pool to be
> > reference-counted, and have each release callback in the array tree
> > decrement the reference count.  Does that sound reasonable to you?
> >
>
> Sure, I can do that. But I imagine consumers may also be a bit surprised by
> this behavior, that releasing the children doesn't actually free up any
> memory.

This should be documented in the producer API, I think, that even if
you move the children out of the parent that the common resource
persists beyond the parents' release() invocation. It may vary between
producer implementations of the C interface.

> The spec should also probably cover thread-safety: if the consumer gets an
> ArrowArray, is it safe to pass off the children to multiple threads and
> have them call release() concurrently? In other words, do I need to use a
> thread-safe reference count? I would guess so.

Yes, e.g. std::shared_ptr or similar. I agree that the spec should
address or at least provide a strong recommendation regarding
thread-safety of resources shared by distinct ArrowArray structures in
their private_data. While in many scenarios the top-level release
callback will handle destruction and any related thread-safety issues,
as soon as any children are moved out of the parent, their release
callbacks could be called at many time.

So in the spec it says

"It is possible to move a child array, but the parent array MUST be
released immediately afterwards, as it won't point to a valid child
array anymore. This satisfies the use case of keeping only a subset of
child arrays, while releasing the others."

we should add a sentence like "It is recommended that producers take
thread-safety into consideration and ensure that moved child arrays'
release callbacks can be called in a concurrent setting."

> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to