Re: [C++] output field names in Arrow Substrait

Li Jin Wed, 20 Apr 2022 08:25:31 -0700

>  I don't think Substratit cannot convince Spark to go index based.
Sorry I meant " I don't think Substrait can convince Spark to go index
based"


On Wed, Apr 20, 2022 at 11:24 AM Li Jin <[email protected]> wrote:

> >  A third alternative might be to adjust Arrow to move to field-index
> based
> execution instead of storing schemas in its relational operators.
>
> This is an interesting idea - I do think there are operators that need
> column name information (e.g. a "checkpoint" node that writes to external
> storage). In this case, it sounds like any operator that requires knowing
> the column name needs to store additional information for the column names
> in addition to the field index, in its substrait representation.
>
> > The names at the relation root are a different story, since they specify
> the column names for the produced table, but there's already a mechanism
> for specifying those.
>
> I feel this is a bit limited because I don't think relation roots are the
> only thing that produces tables. I can think of two examples in Spark that
> produce tables and are not a relation root. (1) A checkpoint node that
> produces a table to write to external storage (2) A Python UDF node that
> materializes a slice of the table and produces a pd.DataFrame to pass down
> to a user function. I think we will see these cases in Arrow as well. I
> feel it might make sense to have a subclass of the operator that requires
> column names in addition to just root rel.
>
> *On schema-based vs index-based**:*
>
> From personal experience working on Spark it would say that having schema
> available in each operator is useful for developers - I often print
> input/output rows and I feel not having the column names available might
> put a tax on the developer. (But maybe there are other ways to solve this).
>
> Are there other substrait consumers at the moment and how do they deal
> with this? For example, if we were to have Spark consumes substrait I'd
> imagine there would be similar problems because schema is part of Spark
> operators too and I don't think Substratit cannot convince Spark to go
> index based.
> I'd imagine during deserialization of the subtrait plan to Spark it would
> need to compute a schema object for each operator. That's why I think
> computing schema during serialization might be a better solution than going
> for index based in Arrow.
>
>
> On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <[email protected]> wrote:
>
>> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten <
>> [email protected]>
>> wrote:
>>
>> > From a Substrait perspective, it would be up to Ibis to convert the
>> > column names to the correct indices. This can and should be transparent
>> > to the end user; front-ends already need to know what the schemas are
>> > in order to produce a valid Substrait plan, so Ibis should have all the
>> > information it needs to do this conversion.
>> >
>> > I would, however, agree that it's inconsistent for Substrait to require
>> > that column names are specified for read relations, but not for the
>> > other relations. AFAICT nothing on the consumer side can or should make
>> > use of those names. The only use case I can think of is to aid
>> > debugging; if that's the only reason they exist, it should indeed be
>> > possible to specify the names for each relation, but it should be in
>> > the form of as optional, non-functional hints in RelCommon. The names
>> > at the relation root are a different story, since they specify the
>> > column names for the produced table, but there's already a mechanism
>> > for specifying those. No idea if Arrow and Ibis use it (since it's
>> > optional), but Substrait supports it.
>> >
>>
>> ReadRels having column names is probably not strictly necessary for all
>> the
>> ReadRels, but for NamedTable it's necessary. For example, if I wanted to
>> represent a ReadRel independent of storage, that an API could deserialize
>> for metadata, and present that metadata to the user then you need column
>> names. Ibis uses the column names from ReadRels to turn Substrait back
>> into
>> expressions.
>>
>> The RelRoot is very much used in ibis. Every table expression compiles to
>> a
>> RelRoot.
>>
>>
>> > If there *are* special cases I'm not seeing, where the execution engine
>> > really does need the names of intermediate columns, I suppose it would
>> > be up to the engine to figure out what they would be. The information
>> > could also be added via Substrait "advanced extensions," especially
>> > when a user overrides a name, but it wouldn't be a good idea for the
>> > consumer to rely on the existance of these extensions, for
>> > compatibility with other frameworks.
>>
>>
>> > If said special cases turn out to not be so special, a valid case could
>> > be made to pull that extension into Substrait itself. If they're only
>> > attached wherever the user manually overrides a name, I think the size
>> > explosion Phillip is talking about would be avoided. I *really* don't
>> > see why anyone would want to match *generated* names in any functional
>> > way; that's a recipe for undefined behavior.
>> >
>> > None of Substrait's built-in things make use of column names though,
>> > which implies that any Arrow execution node that *does* need it cannot
>> > be described with Substrait, and thus should never appear in an
>> > execution plan deserialized from Substrait. If it currently does,
>> > that's an issue with Arrow's Substrait consumer. If we'd want to
>> > correctly represent those kinds of nodes anyway, we'd already be
>> > dealing with an extension relation, to which we could attach whatever
>> > information we want.
>> >
>> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <[email protected]> wrote:
>> >
>> > > A third alternative might be to adjust Arrow to move to field-index
>> based
>> > > execution instead of storing schemas in its relational operators.
>> > >
>> > > I suspect that deserializing the schema on the arrow side is the right
>> > > solution in at least the near term. Repeating field names where they
>> > aren't
>> > > strictly necessary has a tendency to enlarge plans, and the effect
>> gets
>> > > worse as plans go through transformations.
>> > >
>> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <[email protected]>
>> wrote:
>> > >
>> > > > Hi Weston,
>> > > >
>> > > > To reiterate, the scenario is not a user writing a Substrait plan by
>> > > hand,
>> > > > but a user writing an Ibis expression and the remaining steps -
>> > Substrait
>> > > > compilation, serialization to protobuf, deserialization to an Arrow
>> > plan,
>> > > > and execution of the plan - are done systematically. This issue is
>> > > related
>> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons may
>> > need
>> > > > to bear on this issue.
>> > > > Index based field references should still work here.  For example,
>> if
>> > > > I have tables:
>> > > >
>> > > > Key | LeftPayload    --    Key | RightPayload
>> > > >
>> > > > The join expression would be field(0) == field(2)
>> > > > Right, but the point in this use case is that it would not be
>> > convenient
>> > > > for a user writing an Ibis expression to specify each key of each
>> > table,
>> > > > especially when plenty of (say, 100) tables are being joined. It
>> would
>> > be
>> > > > much more convenient for the user to specify once the name of the
>> key
>> > > that
>> > > > exists in all tables. This is a use case where the key should (not
>> > must)
>> > > be
>> > > > specified by name for convenience.
>> > > > In both of those cases the field names are not part of the plan
>> itself.
>> > > > In the second use case, the user writing the ibis expression is
>> > > specifying
>> > > > a string-name as a parameter to a relation that would later get
>> passed
>> > to
>> > > > an Arrow execution node in an options instance and used to
>> dynamically
>> > > set
>> > > > up field-names. This string-name does in fact appear explicitly in
>> the
>> > > > Substrait plan, not for convenience but of necessity. What does not
>> > > appear
>> > > > in the Substrait plan are output field-names of intermediate
>> relations;
>> > > > this in itself is not a problem, because these field-names can (in
>> > > > principle) be recomputed to allow them to be matched to the
>> dynamically
>> > > > set-up field-names. But the current implementation of Arrow (in its
>> > > > Substrait module) does not do this because the schemata are not
>> > available
>> > > > during deserialization, when the option instances (such as
>> > > > ProjectNodeOptions) that require the schemata are created. Instead,
>> > Arrow
>> > > > (in its Substrait module) ends up with non-natural field-names like
>> > > > "FieldPath(1)" that fail to be matched.
>> > > >
>> > > > That's why this is not a Substrait specific problem, but one that is
>> > > > related to Ibis, Ibis-Substrait, and Arrow together. We think it
>> can be
>> > > > resolved either by changing Substrait to assist Arrow by specifying
>> the
>> > > > field-names of intermediate relations in the plan, so that they are
>> > > readily
>> > > > available during deserialization, or by changing Arrow (in its
>> > Substrait
>> > > > module) to reproduce the field-names during deserialization, like by
>> > > > computing the schemata before options and node instances are
>> created.
>> > We
>> > > > might come up with other solutions in this discussion, of course.
>> > Either
>> > > > way, for this second use case, the field-names of intermediate
>> > relations
>> > > > must be natural; they cannot be left as something like
>> "FieldPath(1)".
>> > > >
>> > > >
>> > > > Yaron.
>> > > > ________________________________
>> > > > From: Weston Pace <[email protected]>
>> > > > Sent: Tuesday, April 19, 2022 7:12 PM
>> > > > To: [email protected] <[email protected]>
>> > > > Cc: Li Jin <[email protected]>
>> > > > Subject: Re: [C++] output field names in Arrow Substrait
>> > > >
>> > > > > However, the problem is there are natural cases in
>> > > > > which an execution node should or must take in a string-name
>> > > >
>> > > > If we can come up with such a case then I agree it would be a
>> problem
>> > > > for Substrait's current definition.  I don't think we can come up
>> with
>> > > > such a case.  Every column that can be referenced by name has a
>> unique
>> > > > index we could use instead.
>> > > >
>> > > > > One use case is an execution node that is joining N
>> > > > > input tables on a column-name that exists in all of them.
>> > > >
>> > > > Index based field references should still work here.  For example,
>> if
>> > > > I have tables:
>> > > >
>> > > > Key | LeftPayload    --    Key | RightPayload
>> > > >
>> > > > The join expression would be field(0) == field(2)
>> > > >
>> > > > > The execution node would make itself more convenient
>> > > > > for the user by allowing specifying a string-name than by
>> > > > > specifying N FieldRef instances (each with the same
>> > > > > name but on a different input table) like the above
>> > > > > requirement would force
>> > > >
>> > > > Substrait plans aren't normally created by humans.  I'm not sure
>> > > > convenience is a factor here.
>> > > >
>> > > > > Another use case for a string-name is when it is used within
>> > > > > the execution node to dynamically set up field-names, e.g., a
>> > > > > node that operates on the input's columns whose name starts
>> > > > > with the given string-name or a node that operates on an input
>> > > > > column whose name is given as data in another input column.
>> > > >
>> > > > In both of those cases the field names are not part of the plan
>> itself.
>> > > >
>> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <[email protected]>
>> wrote:
>> > > > >
>> > > > > Hi Weston,
>> > > > >
>> > > > > Thanks for the quick response.
>> > > > > I think you might have forgotten the links for [1][2][3]
>> > > > > Sorry about the confusion; I use these not as references to links
>> but
>> > > as
>> > > > markers of points I make in the beginning that I elaborate on
>> later, in
>> > > the
>> > > > places where I reuse the markers.
>> > > > > Are you going from Substrait to an Arrow execution plan?
>> > > > > Yes.
>> > > > > For Substrait -> Arrow most of our execution nodes should take in
>> a
>> > > > FieldRef which can be a name but can also be index-based. So I
>> wouldn't
>> > > > expect Substrait's exclusive use of index-based references to be an
>> > > issue.
>> > > > > Yes, I agree this is a design feature of Substrait and that if the
>> > > > execution node takes in a FieldRef then there won't be a problem
>> with
>> > > > Substrait use of an index-based reference for it. If any
>> > to-be-developed
>> > > > execution node must indeed use a FieldRef when taking in a parameter
>> > that
>> > > > refers to columns, then this requirement should be documented for
>> > > > developers of execution nodes.
>> > > > >
>> > > > > However, the problem is there are natural cases in which an
>> execution
>> > > > node should or must take in a string-name; these suffer from the
>> above
>> > > > design feature due to the non-natural field names that it leads to.
>> One
>> > > use
>> > > > case is an execution node that is joining N input tables on a
>> > column-name
>> > > > that exists in all of them. The execution node would make itself
>> more
>> > > > convenient for the user by allowing specifying a string-name than by
>> > > > specifying N FieldRef instances (each with the same name but on a
>> > > different
>> > > > input table) like the above requirement would force. Another use
>> case
>> > > for a
>> > > > string-name is when it is used within the execution node to
>> dynamically
>> > > set
>> > > > up field-names, e.g., a node that operates on the input's columns
>> whose
>> > > > name starts with the given string-name or a node that operates on an
>> > > input
>> > > > column whose name is given as data in another input column.
>> > > > >
>> > > > > This is the main reason we think there should be appropriate
>> > (natural)
>> > > > field names defined for each relation/node.
>> > > > > Also keep in mind that Substrait isn't exactly a query language
>> for
>> > > > user's to be typing by hand.
>> > > > > Yes, I took that into consideration. Namely, the above discussion
>> > > refers
>> > > > to the scenario of a user writing an Ibis expression and the
>> remaining
>> > > > steps - Substrait compilation, serialization to protobuf,,
>> > > deserialization
>> > > > to an Arrow plan, and execution of the plan - are done
>> systematically.
>> > > > >
>> > > > >
>> > > > > Yaron.
>> > > > > ________________________________
>> > > > > From: Weston Pace <[email protected]>
>> > > > > Sent: Tuesday, April 19, 2022 1:01 PM
>> > > > > To: [email protected] <[email protected]>
>> > > > > Cc: Li Jin <[email protected]>
>> > > > > Subject: Re: [C++] output field names in Arrow Substrait
>> > > > >
>> > > > > Hi Yaron, I think you might have forgotten the links for
>> [1][2][3] so
>> > > > > I'm not entirely sure of the context.  Are you going from
>> Substrait
>> > to
>> > > > > an Arrow execution plan?  Or are you going from an Arrow execution
>> > > > > plan to Substrait?
>> > > > >
>> > > > > For Substrait -> Arrow most of our execution nodes should take in
>> a
>> > > > > FieldRef which can be a name but can also be index-based.  So I
>> > > > > wouldn't expect Substrait's exclusive use of index-based
>> references
>> > to
>> > > > > be an issue.  Also keep in mind that Substrait isn't exactly a
>> query
>> > > > > language for user's to be typing by hand.  So, for example, if a
>> user
>> > > > > wants a join and is using Ibis they could type:
>> > > > >
>> > > > > table.semi_join(s, table.x == table.y)
>> > > > >
>> > > > > Ibis is then responsible for converting "x" and "y" to the
>> > appropriate
>> > > > > indices (and it should have all the information needed to do so).
>> > The
>> > > > > Substrait plan will refer to these nodes by index and the
>> > > > > corresponding Arrow execution plan will use integer-based
>> FieldRef.
>> > > > >
>> > > > > For Arrow -> Substrait then I agree that this could become a
>> problem.
>> > > > > Right now the Arrow -> Substrait path is mainly used for internal
>> > > > > testing with the hope that Substrait -> Arrow -> Substrait should
>> > > > > generally be possible (with zero loss of information).  This does
>> not
>> > > > > mean that every Arrow plan will be convertible into Substrait.
>> That
>> > > > > is certainly a potential goal, and PRs to add that capability
>> would
>> > be
>> > > > > welcome, but I don't know if anyone working on the Arrow/Substrait
>> > > > > integration has that goal in mind.  If that is your goal I might
>> be
>> > > > > curious to learn more about your use cases.
>> > > > >
>> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <[email protected]>
>> > wrote:
>> > > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > >
>> > > > > > We ran into an issue due to the fact that, for intermediate
>> > > relations,
>> > > > Substrait does not automatically compute output field names nor
>> allows
>> > > one
>> > > > to explicitly name output fields [1]. This leads to trouble when one
>> > > needs
>> > > > to refer to these output fields by name [2]. We run into this
>> trouble
>> > > when
>> > > > deserializing in Arrow [3] (based on commit 8e13c2dd).
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > One use case where [1] occurs is:
>> > > > > >
>> > > > > > import ibis
>> > > > > >
>> > > > > > import pyarrow as pa
>> > > > > >
>> > > > > > loaded_table = ibis.local_table(...)
>> > > > > >
>> > > > > > cast_table = ibis.mutate(loaded_table,
>> > > > time=loaded_table.time.cast(pa.int64())
>> > > > > >
>> > > > > > Even if loaded_table has nicely named fields, like "time" and
>> > > "value1"
>> > > > and "value2", the resulting cast_table (when deserialized in Arrow)
>> has
>> > > > inconvenient field names like "FieldPath(1)" and "FieldPath(2)" for
>> all
>> > > but
>> > > > the "time" field. I believe that Substrait has all the information
>> to
>> > > > automatically compute the field name "value*" for cast_table, but it
>> > > > doesn't do this. Moreover, the caller has to know the schema of
>> > > > loaded_table in order to set the output field names for cast_table,
>> and
>> > > > this is not convenient. When cast_table is an intermediate relation
>> > > (i.e.,
>> > > > not the root relation of the plan), Substrait also doesn't allow the
>> > > caller
>> > > > to explicitly name the output fields, and there is no place for
>> these
>> > > field
>> > > > names in the Substrait protobuf.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > One use case where [2] occurs is in a join-type operation over
>> > > > cast_table. The caller would normally like to use a field name (and
>> not
>> > > an
>> > > > index) to refer to the join-key. Even if the caller knows the field
>> > name
>> > > > for the join-key in loaded_table, many field names in cast_table
>> (when
>> > > > deserialized in Arrow) are different (and each includes an index in
>> it)
>> > > > than those of loaded_table.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > The case of [3] occurs because Arrow deserializes ExecNodeOption
>> > > > instances before it has Schema instances at hand. At this stage,
>> > without
>> > > > schemata, Arrow cannot compute field names that should be placed in
>> a
>> > > > ExecNodeOption that needs them, in particular ProjectNodeOptions.
>> > > > Currently, Arrow creates an expression vector for the
>> > ProjectNodeOptions
>> > > > instance but leaves the names vector empty, and later defaults each
>> > name
>> > > to
>> > > > the ToString of each expression.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > We'd like your input on this issue. Currently, we see two viable
>> > ways
>> > > > to resolve this issue: (1) add output field name for intermediate
>> > > relations
>> > > > in the Substrait plan so that Arrow can directly access them, or (2)
>> > > > reproduce the field names in Arrow during deserialization by
>> creating
>> > > > Schema instances before ExecNodeOptions instances. In any case, we
>> > think
>> > > > there should be appropriate field names defined for each
>> relation/node.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Note that due to this issue, any Arrow execution node that
>> prompts
>> > > the
>> > > > user to refer to fields by name might not play nicely with
>> Substrait.
>> > > > > >
>> > > > > >
>> > > > > > Cheers,
>> > > > > > Yaron.
>> > > >
>> > >
>> >
>>
>

Re: [C++] output field names in Arrow Substrait

Reply via email to