Sorry for the many typos :( > That's why I think computing schema during serialization might be a better solution than going for index based in Arrow.
To clarify, by this I meant keeping the substrait representation and in Arrow consumer, compute/construct schema/column names from subtrait field indices. On Wed, Apr 20, 2022 at 11:25 AM Li Jin <ice.xell...@gmail.com> wrote: > > I don't think Substratit cannot convince Spark to go index based. > Sorry I meant " I don't think Substrait can convince Spark to go index > based" > > On Wed, Apr 20, 2022 at 11:24 AM Li Jin <ice.xell...@gmail.com> wrote: > >> > A third alternative might be to adjust Arrow to move to field-index >> based >> execution instead of storing schemas in its relational operators. >> >> This is an interesting idea - I do think there are operators that need >> column name information (e.g. a "checkpoint" node that writes to external >> storage). In this case, it sounds like any operator that requires knowing >> the column name needs to store additional information for the column names >> in addition to the field index, in its substrait representation. >> >> > The names at the relation root are a different story, since they >> specify the column names for the produced table, but there's already a >> mechanism for specifying those. >> >> I feel this is a bit limited because I don't think relation roots are the >> only thing that produces tables. I can think of two examples in Spark that >> produce tables and are not a relation root. (1) A checkpoint node that >> produces a table to write to external storage (2) A Python UDF node that >> materializes a slice of the table and produces a pd.DataFrame to pass down >> to a user function. I think we will see these cases in Arrow as well. I >> feel it might make sense to have a subclass of the operator that requires >> column names in addition to just root rel. >> >> *On schema-based vs index-based**:* >> >> From personal experience working on Spark it would say that having schema >> available in each operator is useful for developers - I often print >> input/output rows and I feel not having the column names available might >> put a tax on the developer. (But maybe there are other ways to solve this). >> >> Are there other substrait consumers at the moment and how do they deal >> with this? For example, if we were to have Spark consumes substrait I'd >> imagine there would be similar problems because schema is part of Spark >> operators too and I don't think Substratit cannot convince Spark to go >> index based. >> I'd imagine during deserialization of the subtrait plan to Spark it would >> need to compute a schema object for each operator. That's why I think >> computing schema during serialization might be a better solution than going >> for index based in Arrow. >> >> >> On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <cpcl...@gmail.com> wrote: >> >>> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten < >>> jeroen.van.stra...@gmail.com> >>> wrote: >>> >>> > From a Substrait perspective, it would be up to Ibis to convert the >>> > column names to the correct indices. This can and should be transparent >>> > to the end user; front-ends already need to know what the schemas are >>> > in order to produce a valid Substrait plan, so Ibis should have all the >>> > information it needs to do this conversion. >>> > >>> > I would, however, agree that it's inconsistent for Substrait to require >>> > that column names are specified for read relations, but not for the >>> > other relations. AFAICT nothing on the consumer side can or should make >>> > use of those names. The only use case I can think of is to aid >>> > debugging; if that's the only reason they exist, it should indeed be >>> > possible to specify the names for each relation, but it should be in >>> > the form of as optional, non-functional hints in RelCommon. The names >>> > at the relation root are a different story, since they specify the >>> > column names for the produced table, but there's already a mechanism >>> > for specifying those. No idea if Arrow and Ibis use it (since it's >>> > optional), but Substrait supports it. >>> > >>> >>> ReadRels having column names is probably not strictly necessary for all >>> the >>> ReadRels, but for NamedTable it's necessary. For example, if I wanted to >>> represent a ReadRel independent of storage, that an API could deserialize >>> for metadata, and present that metadata to the user then you need column >>> names. Ibis uses the column names from ReadRels to turn Substrait back >>> into >>> expressions. >>> >>> The RelRoot is very much used in ibis. Every table expression compiles >>> to a >>> RelRoot. >>> >>> >>> > If there *are* special cases I'm not seeing, where the execution engine >>> > really does need the names of intermediate columns, I suppose it would >>> > be up to the engine to figure out what they would be. The information >>> > could also be added via Substrait "advanced extensions," especially >>> > when a user overrides a name, but it wouldn't be a good idea for the >>> > consumer to rely on the existance of these extensions, for >>> > compatibility with other frameworks. >>> >>> >>> > If said special cases turn out to not be so special, a valid case could >>> > be made to pull that extension into Substrait itself. If they're only >>> > attached wherever the user manually overrides a name, I think the size >>> > explosion Phillip is talking about would be avoided. I *really* don't >>> > see why anyone would want to match *generated* names in any functional >>> > way; that's a recipe for undefined behavior. >>> > >>> > None of Substrait's built-in things make use of column names though, >>> > which implies that any Arrow execution node that *does* need it cannot >>> > be described with Substrait, and thus should never appear in an >>> > execution plan deserialized from Substrait. If it currently does, >>> > that's an issue with Arrow's Substrait consumer. If we'd want to >>> > correctly represent those kinds of nodes anyway, we'd already be >>> > dealing with an extension relation, to which we could attach whatever >>> > information we want. >>> > >>> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <cpcl...@gmail.com> wrote: >>> > >>> > > A third alternative might be to adjust Arrow to move to field-index >>> based >>> > > execution instead of storing schemas in its relational operators. >>> > > >>> > > I suspect that deserializing the schema on the arrow side is the >>> right >>> > > solution in at least the near term. Repeating field names where they >>> > aren't >>> > > strictly necessary has a tendency to enlarge plans, and the effect >>> gets >>> > > worse as plans go through transformations. >>> > > >>> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <rt...@hotmail.com> >>> wrote: >>> > > >>> > > > Hi Weston, >>> > > > >>> > > > To reiterate, the scenario is not a user writing a Substrait plan >>> by >>> > > hand, >>> > > > but a user writing an Ibis expression and the remaining steps - >>> > Substrait >>> > > > compilation, serialization to protobuf, deserialization to an Arrow >>> > plan, >>> > > > and execution of the plan - are done systematically. This issue is >>> > > related >>> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons >>> may >>> > need >>> > > > to bear on this issue. >>> > > > Index based field references should still work here. For example, >>> if >>> > > > I have tables: >>> > > > >>> > > > Key | LeftPayload -- Key | RightPayload >>> > > > >>> > > > The join expression would be field(0) == field(2) >>> > > > Right, but the point in this use case is that it would not be >>> > convenient >>> > > > for a user writing an Ibis expression to specify each key of each >>> > table, >>> > > > especially when plenty of (say, 100) tables are being joined. It >>> would >>> > be >>> > > > much more convenient for the user to specify once the name of the >>> key >>> > > that >>> > > > exists in all tables. This is a use case where the key should (not >>> > must) >>> > > be >>> > > > specified by name for convenience. >>> > > > In both of those cases the field names are not part of the plan >>> itself. >>> > > > In the second use case, the user writing the ibis expression is >>> > > specifying >>> > > > a string-name as a parameter to a relation that would later get >>> passed >>> > to >>> > > > an Arrow execution node in an options instance and used to >>> dynamically >>> > > set >>> > > > up field-names. This string-name does in fact appear explicitly in >>> the >>> > > > Substrait plan, not for convenience but of necessity. What does not >>> > > appear >>> > > > in the Substrait plan are output field-names of intermediate >>> relations; >>> > > > this in itself is not a problem, because these field-names can (in >>> > > > principle) be recomputed to allow them to be matched to the >>> dynamically >>> > > > set-up field-names. But the current implementation of Arrow (in its >>> > > > Substrait module) does not do this because the schemata are not >>> > available >>> > > > during deserialization, when the option instances (such as >>> > > > ProjectNodeOptions) that require the schemata are created. Instead, >>> > Arrow >>> > > > (in its Substrait module) ends up with non-natural field-names like >>> > > > "FieldPath(1)" that fail to be matched. >>> > > > >>> > > > That's why this is not a Substrait specific problem, but one that >>> is >>> > > > related to Ibis, Ibis-Substrait, and Arrow together. We think it >>> can be >>> > > > resolved either by changing Substrait to assist Arrow by >>> specifying the >>> > > > field-names of intermediate relations in the plan, so that they are >>> > > readily >>> > > > available during deserialization, or by changing Arrow (in its >>> > Substrait >>> > > > module) to reproduce the field-names during deserialization, like >>> by >>> > > > computing the schemata before options and node instances are >>> created. >>> > We >>> > > > might come up with other solutions in this discussion, of course. >>> > Either >>> > > > way, for this second use case, the field-names of intermediate >>> > relations >>> > > > must be natural; they cannot be left as something like >>> "FieldPath(1)". >>> > > > >>> > > > >>> > > > Yaron. >>> > > > ________________________________ >>> > > > From: Weston Pace <weston.p...@gmail.com> >>> > > > Sent: Tuesday, April 19, 2022 7:12 PM >>> > > > To: dev@arrow.apache.org <dev@arrow.apache.org> >>> > > > Cc: Li Jin <ice.xell...@gmail.com> >>> > > > Subject: Re: [C++] output field names in Arrow Substrait >>> > > > >>> > > > > However, the problem is there are natural cases in >>> > > > > which an execution node should or must take in a string-name >>> > > > >>> > > > If we can come up with such a case then I agree it would be a >>> problem >>> > > > for Substrait's current definition. I don't think we can come up >>> with >>> > > > such a case. Every column that can be referenced by name has a >>> unique >>> > > > index we could use instead. >>> > > > >>> > > > > One use case is an execution node that is joining N >>> > > > > input tables on a column-name that exists in all of them. >>> > > > >>> > > > Index based field references should still work here. For example, >>> if >>> > > > I have tables: >>> > > > >>> > > > Key | LeftPayload -- Key | RightPayload >>> > > > >>> > > > The join expression would be field(0) == field(2) >>> > > > >>> > > > > The execution node would make itself more convenient >>> > > > > for the user by allowing specifying a string-name than by >>> > > > > specifying N FieldRef instances (each with the same >>> > > > > name but on a different input table) like the above >>> > > > > requirement would force >>> > > > >>> > > > Substrait plans aren't normally created by humans. I'm not sure >>> > > > convenience is a factor here. >>> > > > >>> > > > > Another use case for a string-name is when it is used within >>> > > > > the execution node to dynamically set up field-names, e.g., a >>> > > > > node that operates on the input's columns whose name starts >>> > > > > with the given string-name or a node that operates on an input >>> > > > > column whose name is given as data in another input column. >>> > > > >>> > > > In both of those cases the field names are not part of the plan >>> itself. >>> > > > >>> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <rt...@hotmail.com> >>> wrote: >>> > > > > >>> > > > > Hi Weston, >>> > > > > >>> > > > > Thanks for the quick response. >>> > > > > I think you might have forgotten the links for [1][2][3] >>> > > > > Sorry about the confusion; I use these not as references to >>> links but >>> > > as >>> > > > markers of points I make in the beginning that I elaborate on >>> later, in >>> > > the >>> > > > places where I reuse the markers. >>> > > > > Are you going from Substrait to an Arrow execution plan? >>> > > > > Yes. >>> > > > > For Substrait -> Arrow most of our execution nodes should take >>> in a >>> > > > FieldRef which can be a name but can also be index-based. So I >>> wouldn't >>> > > > expect Substrait's exclusive use of index-based references to be an >>> > > issue. >>> > > > > Yes, I agree this is a design feature of Substrait and that if >>> the >>> > > > execution node takes in a FieldRef then there won't be a problem >>> with >>> > > > Substrait use of an index-based reference for it. If any >>> > to-be-developed >>> > > > execution node must indeed use a FieldRef when taking in a >>> parameter >>> > that >>> > > > refers to columns, then this requirement should be documented for >>> > > > developers of execution nodes. >>> > > > > >>> > > > > However, the problem is there are natural cases in which an >>> execution >>> > > > node should or must take in a string-name; these suffer from the >>> above >>> > > > design feature due to the non-natural field names that it leads >>> to. One >>> > > use >>> > > > case is an execution node that is joining N input tables on a >>> > column-name >>> > > > that exists in all of them. The execution node would make itself >>> more >>> > > > convenient for the user by allowing specifying a string-name than >>> by >>> > > > specifying N FieldRef instances (each with the same name but on a >>> > > different >>> > > > input table) like the above requirement would force. Another use >>> case >>> > > for a >>> > > > string-name is when it is used within the execution node to >>> dynamically >>> > > set >>> > > > up field-names, e.g., a node that operates on the input's columns >>> whose >>> > > > name starts with the given string-name or a node that operates on >>> an >>> > > input >>> > > > column whose name is given as data in another input column. >>> > > > > >>> > > > > This is the main reason we think there should be appropriate >>> > (natural) >>> > > > field names defined for each relation/node. >>> > > > > Also keep in mind that Substrait isn't exactly a query language >>> for >>> > > > user's to be typing by hand. >>> > > > > Yes, I took that into consideration. Namely, the above discussion >>> > > refers >>> > > > to the scenario of a user writing an Ibis expression and the >>> remaining >>> > > > steps - Substrait compilation, serialization to protobuf,, >>> > > deserialization >>> > > > to an Arrow plan, and execution of the plan - are done >>> systematically. >>> > > > > >>> > > > > >>> > > > > Yaron. >>> > > > > ________________________________ >>> > > > > From: Weston Pace <weston.p...@gmail.com> >>> > > > > Sent: Tuesday, April 19, 2022 1:01 PM >>> > > > > To: dev@arrow.apache.org <dev@arrow.apache.org> >>> > > > > Cc: Li Jin <ice.xell...@gmail.com> >>> > > > > Subject: Re: [C++] output field names in Arrow Substrait >>> > > > > >>> > > > > Hi Yaron, I think you might have forgotten the links for >>> [1][2][3] so >>> > > > > I'm not entirely sure of the context. Are you going from >>> Substrait >>> > to >>> > > > > an Arrow execution plan? Or are you going from an Arrow >>> execution >>> > > > > plan to Substrait? >>> > > > > >>> > > > > For Substrait -> Arrow most of our execution nodes should take >>> in a >>> > > > > FieldRef which can be a name but can also be index-based. So I >>> > > > > wouldn't expect Substrait's exclusive use of index-based >>> references >>> > to >>> > > > > be an issue. Also keep in mind that Substrait isn't exactly a >>> query >>> > > > > language for user's to be typing by hand. So, for example, if a >>> user >>> > > > > wants a join and is using Ibis they could type: >>> > > > > >>> > > > > table.semi_join(s, table.x == table.y) >>> > > > > >>> > > > > Ibis is then responsible for converting "x" and "y" to the >>> > appropriate >>> > > > > indices (and it should have all the information needed to do so). >>> > The >>> > > > > Substrait plan will refer to these nodes by index and the >>> > > > > corresponding Arrow execution plan will use integer-based >>> FieldRef. >>> > > > > >>> > > > > For Arrow -> Substrait then I agree that this could become a >>> problem. >>> > > > > Right now the Arrow -> Substrait path is mainly used for internal >>> > > > > testing with the hope that Substrait -> Arrow -> Substrait should >>> > > > > generally be possible (with zero loss of information). This >>> does not >>> > > > > mean that every Arrow plan will be convertible into Substrait. >>> That >>> > > > > is certainly a potential goal, and PRs to add that capability >>> would >>> > be >>> > > > > welcome, but I don't know if anyone working on the >>> Arrow/Substrait >>> > > > > integration has that goal in mind. If that is your goal I might >>> be >>> > > > > curious to learn more about your use cases. >>> > > > > >>> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <rt...@hotmail.com> >>> > wrote: >>> > > > > > >>> > > > > > Hi, >>> > > > > > >>> > > > > > >>> > > > > > We ran into an issue due to the fact that, for intermediate >>> > > relations, >>> > > > Substrait does not automatically compute output field names nor >>> allows >>> > > one >>> > > > to explicitly name output fields [1]. This leads to trouble when >>> one >>> > > needs >>> > > > to refer to these output fields by name [2]. We run into this >>> trouble >>> > > when >>> > > > deserializing in Arrow [3] (based on commit 8e13c2dd). >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > One use case where [1] occurs is: >>> > > > > > >>> > > > > > import ibis >>> > > > > > >>> > > > > > import pyarrow as pa >>> > > > > > >>> > > > > > loaded_table = ibis.local_table(...) >>> > > > > > >>> > > > > > cast_table = ibis.mutate(loaded_table, >>> > > > time=loaded_table.time.cast(pa.int64()) >>> > > > > > >>> > > > > > Even if loaded_table has nicely named fields, like "time" and >>> > > "value1" >>> > > > and "value2", the resulting cast_table (when deserialized in >>> Arrow) has >>> > > > inconvenient field names like "FieldPath(1)" and "FieldPath(2)" >>> for all >>> > > but >>> > > > the "time" field. I believe that Substrait has all the information >>> to >>> > > > automatically compute the field name "value*" for cast_table, but >>> it >>> > > > doesn't do this. Moreover, the caller has to know the schema of >>> > > > loaded_table in order to set the output field names for >>> cast_table, and >>> > > > this is not convenient. When cast_table is an intermediate relation >>> > > (i.e., >>> > > > not the root relation of the plan), Substrait also doesn't allow >>> the >>> > > caller >>> > > > to explicitly name the output fields, and there is no place for >>> these >>> > > field >>> > > > names in the Substrait protobuf. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > One use case where [2] occurs is in a join-type operation over >>> > > > cast_table. The caller would normally like to use a field name >>> (and not >>> > > an >>> > > > index) to refer to the join-key. Even if the caller knows the field >>> > name >>> > > > for the join-key in loaded_table, many field names in cast_table >>> (when >>> > > > deserialized in Arrow) are different (and each includes an index >>> in it) >>> > > > than those of loaded_table. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > The case of [3] occurs because Arrow deserializes >>> ExecNodeOption >>> > > > instances before it has Schema instances at hand. At this stage, >>> > without >>> > > > schemata, Arrow cannot compute field names that should be placed >>> in a >>> > > > ExecNodeOption that needs them, in particular ProjectNodeOptions. >>> > > > Currently, Arrow creates an expression vector for the >>> > ProjectNodeOptions >>> > > > instance but leaves the names vector empty, and later defaults each >>> > name >>> > > to >>> > > > the ToString of each expression. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > We'd like your input on this issue. Currently, we see two >>> viable >>> > ways >>> > > > to resolve this issue: (1) add output field name for intermediate >>> > > relations >>> > > > in the Substrait plan so that Arrow can directly access them, or >>> (2) >>> > > > reproduce the field names in Arrow during deserialization by >>> creating >>> > > > Schema instances before ExecNodeOptions instances. In any case, we >>> > think >>> > > > there should be appropriate field names defined for each >>> relation/node. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > Note that due to this issue, any Arrow execution node that >>> prompts >>> > > the >>> > > > user to refer to fields by name might not play nicely with >>> Substrait. >>> > > > > > >>> > > > > > >>> > > > > > Cheers, >>> > > > > > Yaron. >>> > > > >>> > > >>> > >>> >>