Hey folks, I've started working on a patch to make Apache Kudu's C++ client able to expose batches of data in Arrow's new C-style interface ( https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst )
I had a couple questions / items that should be clarified in the spec. Wes suggested I raise them here on dev@: *1) Should producers expect callers to zero-init structs?* The spec suggests that producers have an interface like: Status Produce(ArrowArray* array) { ... } In the case of Arrow's own producer implementation, it doesnt assume that 'array' has been initialized in any way prior to this call, and the first thing it does is zero the memory of 'array'. This is pretty standard behavior in C-style APIs (eg stat(2) doesn't assume that its out-argument is initialized in any way) An alternate approach would be to assume that 'array' is in some valid state, and c all array->release() if it is non-null prior to filling in the array with new data. This is a more C++-style API: in C++ it's rare to have uninitialized structures floating around because constructors usually put objects into some kind of valid state before the object gets passed anywhere. The answer here is probably "up to you", but might be good to have some guidance here in the spec doc. I suppose since it's the "C interface" it's probably best to follow the C-style "producer assumes the argument contains uninitialized memory" convention. *2) Clarify lifetime semantics for nested structures* In my application, i'd like to allocate all of the children structures of an ArrowSchema or ArrowArray out of a memory pool which is stored in the private_data field of the top-level struct. As such, my initial implementation was to make the 'release' callback on the top-level struct delete that memory pool, and set the 'release' callback of all children structs to null, since their memory was totally owned by the top-level struct. I figured this approach was OK because the spec says: > Consumers MUST call a base structure's release callback when they won't be using it anymore, but they MUST not call any of its children's release callbacks (including the optional dictionary). The producer is responsible for releasing the children. That advice seems to indicate that I can do whatever I want with the release callback of the children, including not setting it. However, I found that arrow's ImportArray function would fail a check because the child structures had no release callbacks set. I had to set the release callbacks to a no-op function to work around this. This section of the spec also seems to be a bit in conflict with the following: > It is possible to move a child array, but the parent array MUST be released immediately afterwards, as it won't point to a valid child array anymore. This satisfies the use case of keeping only a subset of child arrays, while releasing the others. ... because if you have a parent array which owns the memory referred to by the child, then moving the child (with a no-op release callback) followed by releasing the parent, you'll end up with an invalid or deallocated child as well. In other words, I think the spec should be explicit that either: (a) every allocated structure should "stand alone" and be individually releasable (and thus moveable) (b) a produced struct must have the same lifetime as all children. Consumers should not release children, and if they release the original base, all children are invalidated regardless of whether they have been moved. Thanks Todd -- Todd Lipcon Software Engineer, Cloudera