> At the moment there is a version at [2] which I will propose be the > official implementation for the Apache Arrow project (although it > needs a tiny bit of cleanup to remove a comment reference to C++). > Assuming the discussion doesn't raise any significant concerns in the > next week or so I'll propose a vote to adopt this.
AFAICT, Substrait currently lacks a specified method for referring from one YAML extension to another. That is, types defined in one YAML file can currently not be used by functions defined in another. This is a rather obvious deficiency that I believe will be solved at one point or another, but until that time, I would propose to rename the file to something like "extensions.yaml" rather than "types.yaml". That way, if we do end up needing to define functions in the same file, they won't stand out as much. > 2. Non-Arrow specific features with wide support across > implementations I would argue that anything that needs at least a de facto standard due to widespread adoption should be added to the actual standard, i.e. added to Substrait itself. CSV especially seems like a no-brainer for this due to its ubiquity. Until something is added to the Substrait specification, I would argue that the extensions should be project-specific (or specific to a small subset of projects); going through a voting process for extensions outside the scope of just Arrow seems to me like just adding something directly to the Substrait specification with extra steps. Failing that, however, I would say that if Arrow handles the voting and adoption for a particular extension, however generic, it should be namespaced and hosted by Arrow. In the same way as the YAML file, perhaps, so other projects using the extensions don't need to pull in all of Arrow just to get to the proto file? For example, https://arrow.apache.org/substrait/v1/extensions.yaml https://arrow.apache.org/substrait/v1/extensions.proto > Would we use "arrow" in the extension name (protobuf extensions, as > opposed to YAML extensions, don't really have a URI but they do have > a "package name")? The type URLs are simply the fully-qualified protobuf message types, and you can nest namespaces as deeply as you like. I don't have a strong opinion as to what format should be used (anything from "arrow.Something" to "org.apache.arrow.substrait.v1.extensions.foo.bar.Something" would do), but it should be sufficiently namespaced so it won't conflict with anything we or anyone else does now or in the foreseeable future. "arrow" as the top is probably unique enough as a toplevel namespace (we use the same in C++, after all), but adding something like "substrait.v1.extensions" seems like a good idea to me. "arrow.CSVFile" could mean a lot more things than a Substrait extension for describing the format of a CSV file, after all. > It is possible some implementations may agree on descriptions for a > common collection of functions (e.g. geoJSON) and then these could > follow the procedures in 2. This seems to be what [1] is for, though it's a bit of a mish-mash of things right now. P.S. This is my first post to the ML, so, hi all! :) I've been working on a generic validator for Substrait plans for a while now [2], and helped with the initial implementation of the Arrow Substrait consumer. [1] https://github.com/substrait-io/substrait/tree/main/extensions [2] https://github.com/substrait-io/substrait/pull/155 On Tue, 19 Apr 2022 at 01:52, Weston Pace <weston.p...@gmail.com> wrote: > As we are starting to add more capability to the C++ Substrait > consumer we are starting to look at spots where extensions are needed > for the Substrait specification. I'm wondering to what degree these > extensions are a part of the Arrow project and to what degree these > are part of a specific implementation. I would appreciate any > guidance or opinions. > > I'd like to give a few specific examples of these extensions as I > think the way we handle it may depend on the nature of the specific > extension. > > 1. Arrow-specific and applicable to all implementations > > The Substrait spec has its own type system[1] which does not include > some Arrow types (e.g. unsigned integers). An "extension" in this > case is mostly a (URI namespaced) name that producers and consumers > can agree on (e.g. > https://arrow.apache.org/substrait/v1/types.yaml#uint8). In the > future there is the potential for some additional metadata to > accompany each type (e.g. a way to express types are variations of > existing types) but this hasn't yet been well defined. > > I think this extension, though rather simple, will be of interest to > all users of Arrow, as well as developers of Arrow implementations > (e.g. consumers), and so the impact is pretty far-reaching. However, > given the relative simplicity, I don't know that we need to do much > beyond Github PRs (e.g. we don't need two implementations to adopt > this, etc.) > > At the moment there is a version at [2] which I will propose be the > official implementation for the Apache Arrow project (although it > needs a tiny bit of cleanup to remove a comment reference to C++). > Assuming the discussion doesn't raise any significant concerns in the > next week or so I'll propose a vote to adopt this. > > Other things could fall into this category. For example, we may need > a file format extension for Arrow IPC files (even if [3] merges we > still would want to extend that once Substrait supports writes). We > may also want to define sink and source relations for the Arrow C > stream interface. For anything in this category I think we should > have a single Arrow supported extension and vote on acceptance of the > initial implementation (as well as a criteria for making updates). > > 2. Non-Arrow specific features with wide support across implementations > > An example here is a CSV file format extension. CSV is an interesting > format as it is not very self-describing and will need a rather > extensive proto message (or messages) to describe how to read and > write files. Several implementations support reading and writing > CSVs[4] and it would seem prudent that we agree on a common > definition. However, CSV is not something Arrow has any ownership > over. This raises a few questions: > > * Would we use "arrow" in the extension name (protobuf extensions, as > opposed to YAML extensions, don't really have a URI but they do have a > "package name")? > * Should we vote on an "official" standard to use across > implementations or let each implementation choose their own? > * Could it live within an Arrow repository or would it always live > outside the Arrow repos? > * If it lived outside the Arrow repos would we include a pointer > within the Arrow repository to the voted on standard (assuming we vote > on a standard)? > > 3. Implementation-specific features > > A major extension category in Substrait is extension functions. > However, these are likely to vary between implementations. It is > possible some implementations may agree on descriptions for a common > collection of functions (e.g. geoJSON) and then these could follow the > procedures in 2. > > In general, I think extension functions are likely to be specific to > individual implementations. There wouldn't need to be any vote on > these and, in some cases, the YAML may be automatically generated > (e.g. in the C++ implementation we would probably like to > automatically generate the YAML from our function registry). > > In addition to extension functions I think it likely that there will > probably also be some examples of relation extensions that are > specific to a given implementation. The YAML and proto files for > these extensions could live in the implementation's code base. > > * Should we support "arrow hosted" names for these extensions (e.g. > https://arrow.apache.org/substrait/cpp/v1/function_types.yaml)? > > [1] https://substrait.io/types/simple_logical_types/ > [2] > https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml > [3] https://github.com/substrait-io/substrait/pull/169 > [4] https://arrow.apache.org/docs/status.html#third-party-data-formats >