Hello all,

The Arrow format has support for extension types, but there's no official way to agree accross implementations on well-known extension types.

This issue has come up a couple times with people wanting to implement support for types such as JSON or UUID in order to enable better interoperability with third-party systems such as Parquet or databases.

I think it's time to discuss and decide how we should progressively standardize some well-known, "canonical", extension types.


I would temptatively propose the following rules:

* Canonical extension types are described in a separate document under the format specifications directory: https://github.com/apache/arrow/tree/master/docs/source/format (note this gets turned into HTML docs by Sphinx => https://arrow.apache.org/docs/index.html)

* Each canonical extension type requires a separate discussion and vote on the mailing-list

* The specification text to be added *must* follow these requirements

1) It *must* have a well-defined name starting with "ARROW:"
2) Its parameters, if any, *must* be described in the proposal
3) Its serialization *must* be described in the proposal and should not require unduly work or unusual software dependencies (for example, a trivial custom text format or JSON would be acceptable) 4) Its expected semantics *should* be described as well and any potential ambiguities or pain points addressed or at least mentioned

* The extension type *should* have one implementation submitted; preferably two if non-trivial (for example if parameterized)


Feel free to comment.

Regards

Antoine.

Reply via email to