Re: [C++] output field names in Arrow Substrait

Phillip Cloud Wed, 20 Apr 2022 07:04:54 -0700

A third alternative might be to adjust Arrow to move to field-index based
execution instead of storing schemas in its relational operators.


I suspect that deserializing the schema on the arrow side is the right
solution in at least the near term. Repeating field names where they aren't
strictly necessary has a tendency to enlarge plans, and the effect gets
worse as plans go through transformations.

On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <[email protected]> wrote:

> Hi Weston,
>
> To reiterate, the scenario is not a user writing a Substrait plan by hand,
> but a user writing an Ibis expression and the remaining steps - Substrait
> compilation, serialization to protobuf, deserialization to an Arrow plan,
> and execution of the plan - are done systematically. This issue is related
> to Ibis, Ibis-Substrait, and Arrow together, so multiple persons may need
> to bear on this issue.
> Index based field references should still work here.  For example, if
> I have tables:
>
> Key | LeftPayload    --    Key | RightPayload
>
> The join expression would be field(0) == field(2)
> Right, but the point in this use case is that it would not be convenient
> for a user writing an Ibis expression to specify each key of each table,
> especially when plenty of (say, 100) tables are being joined. It would be
> much more convenient for the user to specify once the name of the key that
> exists in all tables. This is a use case where the key should (not must) be
> specified by name for convenience.
> In both of those cases the field names are not part of the plan itself.
> In the second use case, the user writing the ibis expression is specifying
> a string-name as a parameter to a relation that would later get passed to
> an Arrow execution node in an options instance and used to dynamically set
> up field-names. This string-name does in fact appear explicitly in the
> Substrait plan, not for convenience but of necessity. What does not appear
> in the Substrait plan are output field-names of intermediate relations;
> this in itself is not a problem, because these field-names can (in
> principle) be recomputed to allow them to be matched to the dynamically
> set-up field-names. But the current implementation of Arrow (in its
> Substrait module) does not do this because the schemata are not available
> during deserialization, when the option instances (such as
> ProjectNodeOptions) that require the schemata are created. Instead, Arrow
> (in its Substrait module) ends up with non-natural field-names like
> "FieldPath(1)" that fail to be matched.
>
> That's why this is not a Substrait specific problem, but one that is
> related to Ibis, Ibis-Substrait, and Arrow together. We think it can be
> resolved either by changing Substrait to assist Arrow by specifying the
> field-names of intermediate relations in the plan, so that they are readily
> available during deserialization, or by changing Arrow (in its Substrait
> module) to reproduce the field-names during deserialization, like by
> computing the schemata before options and node instances are created. We
> might come up with other solutions in this discussion, of course. Either
> way, for this second use case, the field-names of intermediate relations
> must be natural; they cannot be left as something like "FieldPath(1)".
>
>
> Yaron.
> ________________________________
> From: Weston Pace <[email protected]>
> Sent: Tuesday, April 19, 2022 7:12 PM
> To: [email protected] <[email protected]>
> Cc: Li Jin <[email protected]>
> Subject: Re: [C++] output field names in Arrow Substrait
>
> > However, the problem is there are natural cases in
> > which an execution node should or must take in a string-name
>
> If we can come up with such a case then I agree it would be a problem
> for Substrait's current definition.  I don't think we can come up with
> such a case.  Every column that can be referenced by name has a unique
> index we could use instead.
>
> > One use case is an execution node that is joining N
> > input tables on a column-name that exists in all of them.
>
> Index based field references should still work here.  For example, if
> I have tables:
>
> Key | LeftPayload    --    Key | RightPayload
>
> The join expression would be field(0) == field(2)
>
> > The execution node would make itself more convenient
> > for the user by allowing specifying a string-name than by
> > specifying N FieldRef instances (each with the same
> > name but on a different input table) like the above
> > requirement would force
>
> Substrait plans aren't normally created by humans.  I'm not sure
> convenience is a factor here.
>
> > Another use case for a string-name is when it is used within
> > the execution node to dynamically set up field-names, e.g., a
> > node that operates on the input's columns whose name starts
> > with the given string-name or a node that operates on an input
> > column whose name is given as data in another input column.
>
> In both of those cases the field names are not part of the plan itself.
>
> On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <[email protected]> wrote:
> >
> > Hi Weston,
> >
> > Thanks for the quick response.
> > I think you might have forgotten the links for [1][2][3]
> > Sorry about the confusion; I use these not as references to links but as
> markers of points I make in the beginning that I elaborate on later, in the
> places where I reuse the markers.
> > Are you going from Substrait to an Arrow execution plan?
> > Yes.
> > For Substrait -> Arrow most of our execution nodes should take in a
> FieldRef which can be a name but can also be index-based. So I wouldn't
> expect Substrait's exclusive use of index-based references to be an issue.
> > Yes, I agree this is a design feature of Substrait and that if the
> execution node takes in a FieldRef then there won't be a problem with
> Substrait use of an index-based reference for it. If any to-be-developed
> execution node must indeed use a FieldRef when taking in a parameter that
> refers to columns, then this requirement should be documented for
> developers of execution nodes.
> >
> > However, the problem is there are natural cases in which an execution
> node should or must take in a string-name; these suffer from the above
> design feature due to the non-natural field names that it leads to. One use
> case is an execution node that is joining N input tables on a column-name
> that exists in all of them. The execution node would make itself more
> convenient for the user by allowing specifying a string-name than by
> specifying N FieldRef instances (each with the same name but on a different
> input table) like the above requirement would force. Another use case for a
> string-name is when it is used within the execution node to dynamically set
> up field-names, e.g., a node that operates on the input's columns whose
> name starts with the given string-name or a node that operates on an input
> column whose name is given as data in another input column.
> >
> > This is the main reason we think there should be appropriate (natural)
> field names defined for each relation/node.
> > Also keep in mind that Substrait isn't exactly a query language for
> user's to be typing by hand.
> > Yes, I took that into consideration. Namely, the above discussion refers
> to the scenario of a user writing an Ibis expression and the remaining
> steps - Substrait compilation, serialization to protobuf,, deserialization
> to an Arrow plan, and execution of the plan - are done systematically.
> >
> >
> > Yaron.
> > ________________________________
> > From: Weston Pace <[email protected]>
> > Sent: Tuesday, April 19, 2022 1:01 PM
> > To: [email protected] <[email protected]>
> > Cc: Li Jin <[email protected]>
> > Subject: Re: [C++] output field names in Arrow Substrait
> >
> > Hi Yaron, I think you might have forgotten the links for [1][2][3] so
> > I'm not entirely sure of the context.  Are you going from Substrait to
> > an Arrow execution plan?  Or are you going from an Arrow execution
> > plan to Substrait?
> >
> > For Substrait -> Arrow most of our execution nodes should take in a
> > FieldRef which can be a name but can also be index-based.  So I
> > wouldn't expect Substrait's exclusive use of index-based references to
> > be an issue.  Also keep in mind that Substrait isn't exactly a query
> > language for user's to be typing by hand.  So, for example, if a user
> > wants a join and is using Ibis they could type:
> >
> > table.semi_join(s, table.x == table.y)
> >
> > Ibis is then responsible for converting "x" and "y" to the appropriate
> > indices (and it should have all the information needed to do so).  The
> > Substrait plan will refer to these nodes by index and the
> > corresponding Arrow execution plan will use integer-based FieldRef.
> >
> > For Arrow -> Substrait then I agree that this could become a problem.
> > Right now the Arrow -> Substrait path is mainly used for internal
> > testing with the hope that Substrait -> Arrow -> Substrait should
> > generally be possible (with zero loss of information).  This does not
> > mean that every Arrow plan will be convertible into Substrait.  That
> > is certainly a potential goal, and PRs to add that capability would be
> > welcome, but I don't know if anyone working on the Arrow/Substrait
> > integration has that goal in mind.  If that is your goal I might be
> > curious to learn more about your use cases.
> >
> > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > >
> > > We ran into an issue due to the fact that, for intermediate relations,
> Substrait does not automatically compute output field names nor allows one
> to explicitly name output fields [1]. This leads to trouble when one needs
> to refer to these output fields by name [2]. We run into this trouble when
> deserializing in Arrow [3] (based on commit 8e13c2dd).
> > >
> > >
> > >
> > > One use case where [1] occurs is:
> > >
> > > import ibis
> > >
> > > import pyarrow as pa
> > >
> > > loaded_table = ibis.local_table(...)
> > >
> > > cast_table = ibis.mutate(loaded_table,
> time=loaded_table.time.cast(pa.int64())
> > >
> > > Even if loaded_table has nicely named fields, like "time" and "value1"
> and "value2", the resulting cast_table (when deserialized in Arrow) has
> inconvenient field names like "FieldPath(1)" and "FieldPath(2)" for all but
> the "time" field. I believe that Substrait has all the information to
> automatically compute the field name "value*" for cast_table, but it
> doesn't do this. Moreover, the caller has to know the schema of
> loaded_table in order to set the output field names for cast_table, and
> this is not convenient. When cast_table is an intermediate relation (i.e.,
> not the root relation of the plan), Substrait also doesn't allow the caller
> to explicitly name the output fields, and there is no place for these field
> names in the Substrait protobuf.
> > >
> > >
> > >
> > > One use case where [2] occurs is in a join-type operation over
> cast_table. The caller would normally like to use a field name (and not an
> index) to refer to the join-key. Even if the caller knows the field name
> for the join-key in loaded_table, many field names in cast_table (when
> deserialized in Arrow) are different (and each includes an index in it)
> than those of loaded_table.
> > >
> > >
> > >
> > > The case of [3] occurs because Arrow deserializes ExecNodeOption
> instances before it has Schema instances at hand. At this stage, without
> schemata, Arrow cannot compute field names that should be placed in a
> ExecNodeOption that needs them, in particular ProjectNodeOptions.
> Currently, Arrow creates an expression vector for the ProjectNodeOptions
> instance but leaves the names vector empty, and later defaults each name to
> the ToString of each expression.
> > >
> > >
> > >
> > > We'd like your input on this issue. Currently, we see two viable ways
> to resolve this issue: (1) add output field name for intermediate relations
> in the Substrait plan so that Arrow can directly access them, or (2)
> reproduce the field names in Arrow during deserialization by creating
> Schema instances before ExecNodeOptions instances. In any case, we think
> there should be appropriate field names defined for each relation/node.
> > >
> > >
> > >
> > > Note that due to this issue, any Arrow execution node that prompts the
> user to refer to fields by name might not play nicely with Substrait.
> > >
> > >
> > > Cheers,
> > > Yaron.
>

Re: [C++] output field names in Arrow Substrait

Reply via email to