As we are starting to add more capability to the C++ Substrait consumer we are starting to look at spots where extensions are needed for the Substrait specification. I'm wondering to what degree these extensions are a part of the Arrow project and to what degree these are part of a specific implementation. I would appreciate any guidance or opinions.
I'd like to give a few specific examples of these extensions as I think the way we handle it may depend on the nature of the specific extension. 1. Arrow-specific and applicable to all implementations The Substrait spec has its own type system[1] which does not include some Arrow types (e.g. unsigned integers). An "extension" in this case is mostly a (URI namespaced) name that producers and consumers can agree on (e.g. https://arrow.apache.org/substrait/v1/types.yaml#uint8). In the future there is the potential for some additional metadata to accompany each type (e.g. a way to express types are variations of existing types) but this hasn't yet been well defined. I think this extension, though rather simple, will be of interest to all users of Arrow, as well as developers of Arrow implementations (e.g. consumers), and so the impact is pretty far-reaching. However, given the relative simplicity, I don't know that we need to do much beyond Github PRs (e.g. we don't need two implementations to adopt this, etc.) At the moment there is a version at [2] which I will propose be the official implementation for the Apache Arrow project (although it needs a tiny bit of cleanup to remove a comment reference to C++). Assuming the discussion doesn't raise any significant concerns in the next week or so I'll propose a vote to adopt this. Other things could fall into this category. For example, we may need a file format extension for Arrow IPC files (even if [3] merges we still would want to extend that once Substrait supports writes). We may also want to define sink and source relations for the Arrow C stream interface. For anything in this category I think we should have a single Arrow supported extension and vote on acceptance of the initial implementation (as well as a criteria for making updates). 2. Non-Arrow specific features with wide support across implementations An example here is a CSV file format extension. CSV is an interesting format as it is not very self-describing and will need a rather extensive proto message (or messages) to describe how to read and write files. Several implementations support reading and writing CSVs[4] and it would seem prudent that we agree on a common definition. However, CSV is not something Arrow has any ownership over. This raises a few questions: * Would we use "arrow" in the extension name (protobuf extensions, as opposed to YAML extensions, don't really have a URI but they do have a "package name")? * Should we vote on an "official" standard to use across implementations or let each implementation choose their own? * Could it live within an Arrow repository or would it always live outside the Arrow repos? * If it lived outside the Arrow repos would we include a pointer within the Arrow repository to the voted on standard (assuming we vote on a standard)? 3. Implementation-specific features A major extension category in Substrait is extension functions. However, these are likely to vary between implementations. It is possible some implementations may agree on descriptions for a common collection of functions (e.g. geoJSON) and then these could follow the procedures in 2. In general, I think extension functions are likely to be specific to individual implementations. There wouldn't need to be any vote on these and, in some cases, the YAML may be automatically generated (e.g. in the C++ implementation we would probably like to automatically generate the YAML from our function registry). In addition to extension functions I think it likely that there will probably also be some examples of relation extensions that are specific to a given implementation. The YAML and proto files for these extensions could live in the implementation's code base. * Should we support "arrow hosted" names for these extensions (e.g. https://arrow.apache.org/substrait/cpp/v1/function_types.yaml)? [1] https://substrait.io/types/simple_logical_types/ [2] https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml [3] https://github.com/substrait-io/substrait/pull/169 [4] https://arrow.apache.org/docs/status.html#third-party-data-formats