A third alternative might be to adjust Arrow to move to field-index based execution instead of storing schemas in its relational operators.
I suspect that deserializing the schema on the arrow side is the right solution in at least the near term. Repeating field names where they aren't strictly necessary has a tendency to enlarge plans, and the effect gets worse as plans go through transformations. On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <rt...@hotmail.com> wrote: > Hi Weston, > > To reiterate, the scenario is not a user writing a Substrait plan by hand, > but a user writing an Ibis expression and the remaining steps - Substrait > compilation, serialization to protobuf, deserialization to an Arrow plan, > and execution of the plan - are done systematically. This issue is related > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons may need > to bear on this issue. > Index based field references should still work here. For example, if > I have tables: > > Key | LeftPayload -- Key | RightPayload > > The join expression would be field(0) == field(2) > Right, but the point in this use case is that it would not be convenient > for a user writing an Ibis expression to specify each key of each table, > especially when plenty of (say, 100) tables are being joined. It would be > much more convenient for the user to specify once the name of the key that > exists in all tables. This is a use case where the key should (not must) be > specified by name for convenience. > In both of those cases the field names are not part of the plan itself. > In the second use case, the user writing the ibis expression is specifying > a string-name as a parameter to a relation that would later get passed to > an Arrow execution node in an options instance and used to dynamically set > up field-names. This string-name does in fact appear explicitly in the > Substrait plan, not for convenience but of necessity. What does not appear > in the Substrait plan are output field-names of intermediate relations; > this in itself is not a problem, because these field-names can (in > principle) be recomputed to allow them to be matched to the dynamically > set-up field-names. But the current implementation of Arrow (in its > Substrait module) does not do this because the schemata are not available > during deserialization, when the option instances (such as > ProjectNodeOptions) that require the schemata are created. Instead, Arrow > (in its Substrait module) ends up with non-natural field-names like > "FieldPath(1)" that fail to be matched. > > That's why this is not a Substrait specific problem, but one that is > related to Ibis, Ibis-Substrait, and Arrow together. We think it can be > resolved either by changing Substrait to assist Arrow by specifying the > field-names of intermediate relations in the plan, so that they are readily > available during deserialization, or by changing Arrow (in its Substrait > module) to reproduce the field-names during deserialization, like by > computing the schemata before options and node instances are created. We > might come up with other solutions in this discussion, of course. Either > way, for this second use case, the field-names of intermediate relations > must be natural; they cannot be left as something like "FieldPath(1)". > > > Yaron. > ________________________________ > From: Weston Pace <weston.p...@gmail.com> > Sent: Tuesday, April 19, 2022 7:12 PM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Cc: Li Jin <ice.xell...@gmail.com> > Subject: Re: [C++] output field names in Arrow Substrait > > > However, the problem is there are natural cases in > > which an execution node should or must take in a string-name > > If we can come up with such a case then I agree it would be a problem > for Substrait's current definition. I don't think we can come up with > such a case. Every column that can be referenced by name has a unique > index we could use instead. > > > One use case is an execution node that is joining N > > input tables on a column-name that exists in all of them. > > Index based field references should still work here. For example, if > I have tables: > > Key | LeftPayload -- Key | RightPayload > > The join expression would be field(0) == field(2) > > > The execution node would make itself more convenient > > for the user by allowing specifying a string-name than by > > specifying N FieldRef instances (each with the same > > name but on a different input table) like the above > > requirement would force > > Substrait plans aren't normally created by humans. I'm not sure > convenience is a factor here. > > > Another use case for a string-name is when it is used within > > the execution node to dynamically set up field-names, e.g., a > > node that operates on the input's columns whose name starts > > with the given string-name or a node that operates on an input > > column whose name is given as data in another input column. > > In both of those cases the field names are not part of the plan itself. > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <rt...@hotmail.com> wrote: > > > > Hi Weston, > > > > Thanks for the quick response. > > I think you might have forgotten the links for [1][2][3] > > Sorry about the confusion; I use these not as references to links but as > markers of points I make in the beginning that I elaborate on later, in the > places where I reuse the markers. > > Are you going from Substrait to an Arrow execution plan? > > Yes. > > For Substrait -> Arrow most of our execution nodes should take in a > FieldRef which can be a name but can also be index-based. So I wouldn't > expect Substrait's exclusive use of index-based references to be an issue. > > Yes, I agree this is a design feature of Substrait and that if the > execution node takes in a FieldRef then there won't be a problem with > Substrait use of an index-based reference for it. If any to-be-developed > execution node must indeed use a FieldRef when taking in a parameter that > refers to columns, then this requirement should be documented for > developers of execution nodes. > > > > However, the problem is there are natural cases in which an execution > node should or must take in a string-name; these suffer from the above > design feature due to the non-natural field names that it leads to. One use > case is an execution node that is joining N input tables on a column-name > that exists in all of them. The execution node would make itself more > convenient for the user by allowing specifying a string-name than by > specifying N FieldRef instances (each with the same name but on a different > input table) like the above requirement would force. Another use case for a > string-name is when it is used within the execution node to dynamically set > up field-names, e.g., a node that operates on the input's columns whose > name starts with the given string-name or a node that operates on an input > column whose name is given as data in another input column. > > > > This is the main reason we think there should be appropriate (natural) > field names defined for each relation/node. > > Also keep in mind that Substrait isn't exactly a query language for > user's to be typing by hand. > > Yes, I took that into consideration. Namely, the above discussion refers > to the scenario of a user writing an Ibis expression and the remaining > steps - Substrait compilation, serialization to protobuf,, deserialization > to an Arrow plan, and execution of the plan - are done systematically. > > > > > > Yaron. > > ________________________________ > > From: Weston Pace <weston.p...@gmail.com> > > Sent: Tuesday, April 19, 2022 1:01 PM > > To: dev@arrow.apache.org <dev@arrow.apache.org> > > Cc: Li Jin <ice.xell...@gmail.com> > > Subject: Re: [C++] output field names in Arrow Substrait > > > > Hi Yaron, I think you might have forgotten the links for [1][2][3] so > > I'm not entirely sure of the context. Are you going from Substrait to > > an Arrow execution plan? Or are you going from an Arrow execution > > plan to Substrait? > > > > For Substrait -> Arrow most of our execution nodes should take in a > > FieldRef which can be a name but can also be index-based. So I > > wouldn't expect Substrait's exclusive use of index-based references to > > be an issue. Also keep in mind that Substrait isn't exactly a query > > language for user's to be typing by hand. So, for example, if a user > > wants a join and is using Ibis they could type: > > > > table.semi_join(s, table.x == table.y) > > > > Ibis is then responsible for converting "x" and "y" to the appropriate > > indices (and it should have all the information needed to do so). The > > Substrait plan will refer to these nodes by index and the > > corresponding Arrow execution plan will use integer-based FieldRef. > > > > For Arrow -> Substrait then I agree that this could become a problem. > > Right now the Arrow -> Substrait path is mainly used for internal > > testing with the hope that Substrait -> Arrow -> Substrait should > > generally be possible (with zero loss of information). This does not > > mean that every Arrow plan will be convertible into Substrait. That > > is certainly a potential goal, and PRs to add that capability would be > > welcome, but I don't know if anyone working on the Arrow/Substrait > > integration has that goal in mind. If that is your goal I might be > > curious to learn more about your use cases. > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <rt...@hotmail.com> wrote: > > > > > > Hi, > > > > > > > > > We ran into an issue due to the fact that, for intermediate relations, > Substrait does not automatically compute output field names nor allows one > to explicitly name output fields [1]. This leads to trouble when one needs > to refer to these output fields by name [2]. We run into this trouble when > deserializing in Arrow [3] (based on commit 8e13c2dd). > > > > > > > > > > > > One use case where [1] occurs is: > > > > > > import ibis > > > > > > import pyarrow as pa > > > > > > loaded_table = ibis.local_table(...) > > > > > > cast_table = ibis.mutate(loaded_table, > time=loaded_table.time.cast(pa.int64()) > > > > > > Even if loaded_table has nicely named fields, like "time" and "value1" > and "value2", the resulting cast_table (when deserialized in Arrow) has > inconvenient field names like "FieldPath(1)" and "FieldPath(2)" for all but > the "time" field. I believe that Substrait has all the information to > automatically compute the field name "value*" for cast_table, but it > doesn't do this. Moreover, the caller has to know the schema of > loaded_table in order to set the output field names for cast_table, and > this is not convenient. When cast_table is an intermediate relation (i.e., > not the root relation of the plan), Substrait also doesn't allow the caller > to explicitly name the output fields, and there is no place for these field > names in the Substrait protobuf. > > > > > > > > > > > > One use case where [2] occurs is in a join-type operation over > cast_table. The caller would normally like to use a field name (and not an > index) to refer to the join-key. Even if the caller knows the field name > for the join-key in loaded_table, many field names in cast_table (when > deserialized in Arrow) are different (and each includes an index in it) > than those of loaded_table. > > > > > > > > > > > > The case of [3] occurs because Arrow deserializes ExecNodeOption > instances before it has Schema instances at hand. At this stage, without > schemata, Arrow cannot compute field names that should be placed in a > ExecNodeOption that needs them, in particular ProjectNodeOptions. > Currently, Arrow creates an expression vector for the ProjectNodeOptions > instance but leaves the names vector empty, and later defaults each name to > the ToString of each expression. > > > > > > > > > > > > We'd like your input on this issue. Currently, we see two viable ways > to resolve this issue: (1) add output field name for intermediate relations > in the Substrait plan so that Arrow can directly access them, or (2) > reproduce the field names in Arrow during deserialization by creating > Schema instances before ExecNodeOptions instances. In any case, we think > there should be appropriate field names defined for each relation/node. > > > > > > > > > > > > Note that due to this issue, any Arrow execution node that prompts the > user to refer to fields by name might not play nicely with Substrait. > > > > > > > > > Cheers, > > > Yaron. >