Hello all,
The Arrow format has support for extension types, but there's no
official way to agree accross implementations on well-known extension types.
This issue has come up a couple times with people wanting to implement
support for types such as JSON or UUID in order to enable better
interoperability with third-party systems such as Parquet or databases.
I think it's time to discuss and decide how we should progressively
standardize some well-known, "canonical", extension types.
I would temptatively propose the following rules:
* Canonical extension types are described in a separate document under
the format specifications directory:
https://github.com/apache/arrow/tree/master/docs/source/format (note
this gets turned into HTML docs by Sphinx =>
https://arrow.apache.org/docs/index.html)
* Each canonical extension type requires a separate discussion and vote
on the mailing-list
* The specification text to be added *must* follow these requirements
1) It *must* have a well-defined name starting with "ARROW:"
2) Its parameters, if any, *must* be described in the proposal
3) Its serialization *must* be described in the proposal and should not
require unduly work or unusual software dependencies (for example, a
trivial custom text format or JSON would be acceptable)
4) Its expected semantics *should* be described as well and any
potential ambiguities or pain points addressed or at least mentioned
* The extension type *should* have one implementation submitted;
preferably two if non-trivial (for example if parameterized)
Feel free to comment.
Regards
Antoine.