As we are starting to add more capability to the C++ Substrait
consumer we are starting to look at spots where extensions are needed
for the Substrait specification.  I'm wondering to what degree these
extensions are a part of the Arrow project and to what degree these
are part of a specific implementation.  I would appreciate any
guidance or opinions.

I'd like to give a few specific examples of these extensions as I
think the way we handle it may depend on the nature of the specific
extension.

1. Arrow-specific and applicable to all implementations

The Substrait spec has its own type system[1] which does not include
some Arrow types (e.g. unsigned integers).  An "extension" in this
case is mostly a (URI namespaced) name that producers and consumers
can agree on (e.g.
https://arrow.apache.org/substrait/v1/types.yaml#uint8).  In the
future there is the potential for some additional metadata to
accompany each type (e.g. a way to express types are variations of
existing types) but this hasn't yet been well defined.

I think this extension, though rather simple, will be of interest to
all users of Arrow, as well as developers of Arrow implementations
(e.g. consumers), and so the impact is pretty far-reaching.  However,
given the relative simplicity, I don't know that we need to do much
beyond Github PRs (e.g. we don't need two implementations to adopt
this, etc.)

At the moment there is a version at [2] which I will propose be the
official implementation for the Apache Arrow project (although it
needs a tiny bit of cleanup to remove a comment reference to C++).
Assuming the discussion doesn't raise any significant concerns in the
next week or so I'll propose a vote to adopt this.

Other things could fall into this category.  For example, we may need
a file format extension for Arrow IPC files (even if [3] merges we
still would want to extend that once Substrait supports writes).  We
may also want to define sink and source relations for the Arrow C
stream interface.  For anything in this category I think we should
have a single Arrow supported extension and vote on acceptance of the
initial implementation (as well as a criteria for making updates).

2. Non-Arrow specific features with wide support across implementations

An example here is a CSV file format extension.  CSV is an interesting
format as it is not very self-describing and will need a rather
extensive proto message (or messages) to describe how to read and
write files.  Several implementations support reading and writing
CSVs[4] and it would seem prudent that we agree on a common
definition.  However, CSV is not something Arrow has any ownership
over.  This raises a few questions:

 * Would we use "arrow" in the extension name (protobuf extensions, as
opposed to YAML extensions, don't really have a URI but they do have a
"package name")?
 * Should we vote on an "official" standard to use across
implementations or let each implementation choose their own?
 * Could it live within an Arrow repository or would it always live
outside the Arrow repos?
 * If it lived outside the Arrow repos would we include a pointer
within the Arrow repository to the voted on standard (assuming we vote
on a standard)?

3. Implementation-specific features

A major extension category in Substrait is extension functions.
However, these are likely to vary between implementations.  It is
possible some implementations may agree on descriptions for a common
collection of functions (e.g. geoJSON) and then these could follow the
procedures in 2.

In general, I think extension functions are likely to be specific to
individual implementations.  There wouldn't need to be any vote on
these and, in some cases, the YAML may be automatically generated
(e.g. in the C++ implementation we would probably like to
automatically generate the YAML from our function registry).

In addition to extension functions I think it likely that there will
probably also be some examples of relation extensions that are
specific to a given implementation.  The YAML and proto files for
these extensions could live in the implementation's code base.

 * Should we support "arrow hosted" names for these extensions (e.g.
https://arrow.apache.org/substrait/cpp/v1/function_types.yaml)?

[1] https://substrait.io/types/simple_logical_types/
[2] 
https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml
[3] https://github.com/substrait-io/substrait/pull/169
[4] https://arrow.apache.org/docs/status.html#third-party-data-formats

Reply via email to