An alternative that's worked for us is (ab)using single-child SparseUnions to represent custom types. We have an enum of "well-known" typeIds (UUID, vec2's, IP addresses, etc), whose data is stored in one of the known Arrow types, as you've done.

Pros are the typeIds buffer is tiny, and doesn't require metadata propagation or string matching to maintain type information.

Cons are this is really an abuse of the Union type, and since the typeId buffer is (implicitly?) an Int8, we can only have 255 extension types today. We don't have that many yet, but that could be an issue if this pattern were generalized to any number of custom types.

I'm not sure how widely supported Unions are across the Arrow implementations or ecosystem (unsure about pandas, Rapids/cuDF no support yet), but maybe this pattern could work more generally if we defined an enum of "well-known" extension typeIds?

Thanks,

Paul


On 2/25/19 3:32 PM, Wes McKinney wrote:
hi folks,

I recently wrote a patch to propose a C++ API for user-defined "extension" types

https://github.com/apache/arrow/pull/3694

The idea is that an extension type wraps a pre-existing Arrow type.
For example a UUIDType can be represented as FixedSizeBinary(16). The
intent is that Arrow consumers which are not aware of an extension
type can ignore the additional type metadata and still interact with
the raw storage

One question is how to permit such metadata to be preserved through
IPC / RPC messages (i.e., Schema.fbs) and how other languages can
interact with it. There are couple options:

* What I implemented in my patch: use the Field-level custom_metadata
field with known key names "arrow_extension_name" and
"arrow_extension_data" for the type's unique identifier and serialized
form, respectively. If we opt for this, then we should add a section
to the specification to codify the convention used

* Add a new field to the Field table in Schema.fbs

The former is attractive in the sense that consumers who don't have
special handling for an extension type will carry along the Field
metadata in their schema, so it can be passed on in subsequent IPC
messages without writing any extra code.

Thoughts about this? With a C++ implementation landing, it would be
great to identify a champion to create a Java implementation and also
add integration test support to ensure that consumers do not destroy
the extension type metadata for unrecognized types (i.e. if I send you
data that says it's "uuid" and you don't know what that is yet, you
preserve the metadata fields anyway).

Thanks
Wes

Reply via email to