An alternative that's worked for us is (ab)using single-child
SparseUnions to represent custom types. We have an enum of "well-known"
typeIds (UUID, vec2's, IP addresses, etc), whose data is stored in one
of the known Arrow types, as you've done.
Pros are the typeIds buffer is tiny, and doesn't require metadata
propagation or string matching to maintain type information.
Cons are this is really an abuse of the Union type, and since the typeId
buffer is (implicitly?) an Int8, we can only have 255 extension types
today. We don't have that many yet, but that could be an issue if this
pattern were generalized to any number of custom types.
I'm not sure how widely supported Unions are across the Arrow
implementations or ecosystem (unsure about pandas, Rapids/cuDF no
support yet), but maybe this pattern could work more generally if we
defined an enum of "well-known" extension typeIds?
Thanks,
Paul
On 2/25/19 3:32 PM, Wes McKinney wrote:
hi folks,
I recently wrote a patch to propose a C++ API for user-defined "extension" types
https://github.com/apache/arrow/pull/3694
The idea is that an extension type wraps a pre-existing Arrow type.
For example a UUIDType can be represented as FixedSizeBinary(16). The
intent is that Arrow consumers which are not aware of an extension
type can ignore the additional type metadata and still interact with
the raw storage
One question is how to permit such metadata to be preserved through
IPC / RPC messages (i.e., Schema.fbs) and how other languages can
interact with it. There are couple options:
* What I implemented in my patch: use the Field-level custom_metadata
field with known key names "arrow_extension_name" and
"arrow_extension_data" for the type's unique identifier and serialized
form, respectively. If we opt for this, then we should add a section
to the specification to codify the convention used
* Add a new field to the Field table in Schema.fbs
The former is attractive in the sense that consumers who don't have
special handling for an extension type will carry along the Field
metadata in their schema, so it can be passed on in subsequent IPC
messages without writing any extra code.
Thoughts about this? With a C++ implementation landing, it would be
great to identify a champion to create a Java implementation and also
add integration test support to ensure that consumers do not destroy
the extension type metadata for unrecognized types (i.e. if I send you
data that says it's "uuid" and you don't know what that is yet, you
preserve the metadata fields anyway).
Thanks
Wes