C interface clarifications

Todd Lipcon Mon, 06 Apr 2020 09:19:25 -0700

Hey folks,

I've started working on a patch to make Apache Kudu's C++ client able to
expose batches of data in Arrow's new C-style interface (
https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst
)


I had a couple questions / items that should be clarified in the spec. Wes
suggested I raise them here on dev@:

*1) Should producers expect callers to zero-init structs?*

The spec suggests that producers have an interface like:

Status Produce(ArrowArray* array) {
  ...
}

In the case of Arrow's own producer implementation, it doesnt assume that
'array' has been initialized in any way prior to this call, and the first
thing it does is zero the memory of 'array'. This is pretty standard
behavior in C-style APIs (eg stat(2) doesn't assume that its out-argument
is initialized in any way)

 An alternate approach would be to assume that 'array' is in some valid
state, and c all array->release() if it is non-null prior to filling in the
array with new data. This is a more C++-style API: in C++ it's rare to have
uninitialized structures floating around because constructors usually put
objects into some kind of valid state before the object gets passed
anywhere.

The answer here is probably "up to you", but might be good to have some
guidance here in the spec doc. I suppose since it's the "C interface" it's
probably best to follow the C-style "producer assumes the argument contains
uninitialized memory" convention.

*2) Clarify lifetime semantics for nested structures*

In my application, i'd like to allocate all of the children structures of
an ArrowSchema or ArrowArray out of a memory pool which is stored in the
private_data field of the top-level struct. As such, my initial
implementation was to make the 'release' callback on the top-level struct
delete that memory pool, and set the 'release' callback of all children
structs to null, since their memory was totally owned by the top-level
struct.

I figured this approach was OK because the spec says:

>  Consumers MUST call a base structure's release callback when they won't
be using it anymore, but they MUST not call any of its children's release
callbacks (including the optional dictionary). The producer is responsible
for releasing the children.

That advice seems to indicate that I can do whatever I want with the
release callback of the children, including not setting it.

However, I found that arrow's ImportArray function would fail a check
because the child structures had no release callbacks set. I had to set the
release callbacks to a no-op function to work around this.

This section of the spec also seems to be a bit in conflict with the
following:

> It is possible to move a child array, but the parent array MUST be
released immediately afterwards, as it won't point to a valid child array
anymore. This satisfies the use case of keeping only a subset of child
arrays, while releasing the others.

... because if you have a parent array which owns the memory referred to by
the child, then moving the child (with a no-op release callback) followed
by releasing the parent, you'll end up with an invalid or deallocated child
as well.

In other words, I think the spec should be explicit that either:
(a) every allocated structure should "stand alone" and be individually
releasable (and thus moveable)
(b) a produced struct must have the same lifetime as all children.
Consumers should not release children, and if they release the original
base, all children are invalidated regardless of whether they have been
moved.


Thanks
Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

C interface clarifications

Reply via email to