Re: [C++] output field names in Arrow Substrait

Li Jin Wed, 20 Apr 2022 08:42:13 -0700

Sorry for the many typos :(

> That's why I think computing schema during serialization might be a
better solution than going for index based in Arrow.


To clarify, by this I meant keeping the substrait representation and in
Arrow consumer, compute/construct schema/column names from subtrait field
indices.

On Wed, Apr 20, 2022 at 11:25 AM Li Jin <ice.xell...@gmail.com> wrote:

> >  I don't think Substratit cannot convince Spark to go index based.
> Sorry I meant " I don't think Substrait can convince Spark to go index
> based"
>
> On Wed, Apr 20, 2022 at 11:24 AM Li Jin <ice.xell...@gmail.com> wrote:
>
>> >  A third alternative might be to adjust Arrow to move to field-index
>> based
>> execution instead of storing schemas in its relational operators.
>>
>> This is an interesting idea - I do think there are operators that need
>> column name information (e.g. a "checkpoint" node that writes to external
>> storage). In this case, it sounds like any operator that requires knowing
>> the column name needs to store additional information for the column names
>> in addition to the field index, in its substrait representation.
>>
>> > The names at the relation root are a different story, since they
>> specify the column names for the produced table, but there's already a
>> mechanism for specifying those.
>>
>> I feel this is a bit limited because I don't think relation roots are the
>> only thing that produces tables. I can think of two examples in Spark that
>> produce tables and are not a relation root. (1) A checkpoint node that
>> produces a table to write to external storage (2) A Python UDF node that
>> materializes a slice of the table and produces a pd.DataFrame to pass down
>> to a user function. I think we will see these cases in Arrow as well. I
>> feel it might make sense to have a subclass of the operator that requires
>> column names in addition to just root rel.
>>
>> *On schema-based vs index-based**:*
>>
>> From personal experience working on Spark it would say that having schema
>> available in each operator is useful for developers - I often print
>> input/output rows and I feel not having the column names available might
>> put a tax on the developer. (But maybe there are other ways to solve this).
>>
>> Are there other substrait consumers at the moment and how do they deal
>> with this? For example, if we were to have Spark consumes substrait I'd
>> imagine there would be similar problems because schema is part of Spark
>> operators too and I don't think Substratit cannot convince Spark to go
>> index based.
>> I'd imagine during deserialization of the subtrait plan to Spark it would
>> need to compute a schema object for each operator. That's why I think
>> computing schema during serialization might be a better solution than going
>> for index based in Arrow.
>>
>>
>> On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <cpcl...@gmail.com> wrote:
>>
>>> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten <
>>> jeroen.van.stra...@gmail.com>
>>> wrote:
>>>
>>> > From a Substrait perspective, it would be up to Ibis to convert the
>>> > column names to the correct indices. This can and should be transparent
>>> > to the end user; front-ends already need to know what the schemas are
>>> > in order to produce a valid Substrait plan, so Ibis should have all the
>>> > information it needs to do this conversion.
>>> >
>>> > I would, however, agree that it's inconsistent for Substrait to require
>>> > that column names are specified for read relations, but not for the
>>> > other relations. AFAICT nothing on the consumer side can or should make
>>> > use of those names. The only use case I can think of is to aid
>>> > debugging; if that's the only reason they exist, it should indeed be
>>> > possible to specify the names for each relation, but it should be in
>>> > the form of as optional, non-functional hints in RelCommon. The names
>>> > at the relation root are a different story, since they specify the
>>> > column names for the produced table, but there's already a mechanism
>>> > for specifying those. No idea if Arrow and Ibis use it (since it's
>>> > optional), but Substrait supports it.
>>> >
>>>
>>> ReadRels having column names is probably not strictly necessary for all
>>> the
>>> ReadRels, but for NamedTable it's necessary. For example, if I wanted to
>>> represent a ReadRel independent of storage, that an API could deserialize
>>> for metadata, and present that metadata to the user then you need column
>>> names. Ibis uses the column names from ReadRels to turn Substrait back
>>> into
>>> expressions.
>>>
>>> The RelRoot is very much used in ibis. Every table expression compiles
>>> to a
>>> RelRoot.
>>>
>>>
>>> > If there *are* special cases I'm not seeing, where the execution engine
>>> > really does need the names of intermediate columns, I suppose it would
>>> > be up to the engine to figure out what they would be. The information
>>> > could also be added via Substrait "advanced extensions," especially
>>> > when a user overrides a name, but it wouldn't be a good idea for the
>>> > consumer to rely on the existance of these extensions, for
>>> > compatibility with other frameworks.
>>>
>>>
>>> > If said special cases turn out to not be so special, a valid case could
>>> > be made to pull that extension into Substrait itself. If they're only
>>> > attached wherever the user manually overrides a name, I think the size
>>> > explosion Phillip is talking about would be avoided. I *really* don't
>>> > see why anyone would want to match *generated* names in any functional
>>> > way; that's a recipe for undefined behavior.
>>> >
>>> > None of Substrait's built-in things make use of column names though,
>>> > which implies that any Arrow execution node that *does* need it cannot
>>> > be described with Substrait, and thus should never appear in an
>>> > execution plan deserialized from Substrait. If it currently does,
>>> > that's an issue with Arrow's Substrait consumer. If we'd want to
>>> > correctly represent those kinds of nodes anyway, we'd already be
>>> > dealing with an extension relation, to which we could attach whatever
>>> > information we want.
>>> >
>>> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <cpcl...@gmail.com> wrote:
>>> >
>>> > > A third alternative might be to adjust Arrow to move to field-index
>>> based
>>> > > execution instead of storing schemas in its relational operators.
>>> > >
>>> > > I suspect that deserializing the schema on the arrow side is the
>>> right
>>> > > solution in at least the near term. Repeating field names where they
>>> > aren't
>>> > > strictly necessary has a tendency to enlarge plans, and the effect
>>> gets
>>> > > worse as plans go through transformations.
>>> > >
>>> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <rt...@hotmail.com>
>>> wrote:
>>> > >
>>> > > > Hi Weston,
>>> > > >
>>> > > > To reiterate, the scenario is not a user writing a Substrait plan
>>> by
>>> > > hand,
>>> > > > but a user writing an Ibis expression and the remaining steps -
>>> > Substrait
>>> > > > compilation, serialization to protobuf, deserialization to an Arrow
>>> > plan,
>>> > > > and execution of the plan - are done systematically. This issue is
>>> > > related
>>> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons
>>> may
>>> > need
>>> > > > to bear on this issue.
>>> > > > Index based field references should still work here.  For example,
>>> if
>>> > > > I have tables:
>>> > > >
>>> > > > Key | LeftPayload    --    Key | RightPayload
>>> > > >
>>> > > > The join expression would be field(0) == field(2)
>>> > > > Right, but the point in this use case is that it would not be
>>> > convenient
>>> > > > for a user writing an Ibis expression to specify each key of each
>>> > table,
>>> > > > especially when plenty of (say, 100) tables are being joined. It
>>> would
>>> > be
>>> > > > much more convenient for the user to specify once the name of the
>>> key
>>> > > that
>>> > > > exists in all tables. This is a use case where the key should (not
>>> > must)
>>> > > be
>>> > > > specified by name for convenience.
>>> > > > In both of those cases the field names are not part of the plan
>>> itself.
>>> > > > In the second use case, the user writing the ibis expression is
>>> > > specifying
>>> > > > a string-name as a parameter to a relation that would later get
>>> passed
>>> > to
>>> > > > an Arrow execution node in an options instance and used to
>>> dynamically
>>> > > set
>>> > > > up field-names. This string-name does in fact appear explicitly in
>>> the
>>> > > > Substrait plan, not for convenience but of necessity. What does not
>>> > > appear
>>> > > > in the Substrait plan are output field-names of intermediate
>>> relations;
>>> > > > this in itself is not a problem, because these field-names can (in
>>> > > > principle) be recomputed to allow them to be matched to the
>>> dynamically
>>> > > > set-up field-names. But the current implementation of Arrow (in its
>>> > > > Substrait module) does not do this because the schemata are not
>>> > available
>>> > > > during deserialization, when the option instances (such as
>>> > > > ProjectNodeOptions) that require the schemata are created. Instead,
>>> > Arrow
>>> > > > (in its Substrait module) ends up with non-natural field-names like
>>> > > > "FieldPath(1)" that fail to be matched.
>>> > > >
>>> > > > That's why this is not a Substrait specific problem, but one that
>>> is
>>> > > > related to Ibis, Ibis-Substrait, and Arrow together. We think it
>>> can be
>>> > > > resolved either by changing Substrait to assist Arrow by
>>> specifying the
>>> > > > field-names of intermediate relations in the plan, so that they are
>>> > > readily
>>> > > > available during deserialization, or by changing Arrow (in its
>>> > Substrait
>>> > > > module) to reproduce the field-names during deserialization, like
>>> by
>>> > > > computing the schemata before options and node instances are
>>> created.
>>> > We
>>> > > > might come up with other solutions in this discussion, of course.
>>> > Either
>>> > > > way, for this second use case, the field-names of intermediate
>>> > relations
>>> > > > must be natural; they cannot be left as something like
>>> "FieldPath(1)".
>>> > > >
>>> > > >
>>> > > > Yaron.
>>> > > > ________________________________
>>> > > > From: Weston Pace <weston.p...@gmail.com>
>>> > > > Sent: Tuesday, April 19, 2022 7:12 PM
>>> > > > To: dev@arrow.apache.org <dev@arrow.apache.org>
>>> > > > Cc: Li Jin <ice.xell...@gmail.com>
>>> > > > Subject: Re: [C++] output field names in Arrow Substrait
>>> > > >
>>> > > > > However, the problem is there are natural cases in
>>> > > > > which an execution node should or must take in a string-name
>>> > > >
>>> > > > If we can come up with such a case then I agree it would be a
>>> problem
>>> > > > for Substrait's current definition.  I don't think we can come up
>>> with
>>> > > > such a case.  Every column that can be referenced by name has a
>>> unique
>>> > > > index we could use instead.
>>> > > >
>>> > > > > One use case is an execution node that is joining N
>>> > > > > input tables on a column-name that exists in all of them.
>>> > > >
>>> > > > Index based field references should still work here.  For example,
>>> if
>>> > > > I have tables:
>>> > > >
>>> > > > Key | LeftPayload    --    Key | RightPayload
>>> > > >
>>> > > > The join expression would be field(0) == field(2)
>>> > > >
>>> > > > > The execution node would make itself more convenient
>>> > > > > for the user by allowing specifying a string-name than by
>>> > > > > specifying N FieldRef instances (each with the same
>>> > > > > name but on a different input table) like the above
>>> > > > > requirement would force
>>> > > >
>>> > > > Substrait plans aren't normally created by humans.  I'm not sure
>>> > > > convenience is a factor here.
>>> > > >
>>> > > > > Another use case for a string-name is when it is used within
>>> > > > > the execution node to dynamically set up field-names, e.g., a
>>> > > > > node that operates on the input's columns whose name starts
>>> > > > > with the given string-name or a node that operates on an input
>>> > > > > column whose name is given as data in another input column.
>>> > > >
>>> > > > In both of those cases the field names are not part of the plan
>>> itself.
>>> > > >
>>> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <rt...@hotmail.com>
>>> wrote:
>>> > > > >
>>> > > > > Hi Weston,
>>> > > > >
>>> > > > > Thanks for the quick response.
>>> > > > > I think you might have forgotten the links for [1][2][3]
>>> > > > > Sorry about the confusion; I use these not as references to
>>> links but
>>> > > as
>>> > > > markers of points I make in the beginning that I elaborate on
>>> later, in
>>> > > the
>>> > > > places where I reuse the markers.
>>> > > > > Are you going from Substrait to an Arrow execution plan?
>>> > > > > Yes.
>>> > > > > For Substrait -> Arrow most of our execution nodes should take
>>> in a
>>> > > > FieldRef which can be a name but can also be index-based. So I
>>> wouldn't
>>> > > > expect Substrait's exclusive use of index-based references to be an
>>> > > issue.
>>> > > > > Yes, I agree this is a design feature of Substrait and that if
>>> the
>>> > > > execution node takes in a FieldRef then there won't be a problem
>>> with
>>> > > > Substrait use of an index-based reference for it. If any
>>> > to-be-developed
>>> > > > execution node must indeed use a FieldRef when taking in a
>>> parameter
>>> > that
>>> > > > refers to columns, then this requirement should be documented for
>>> > > > developers of execution nodes.
>>> > > > >
>>> > > > > However, the problem is there are natural cases in which an
>>> execution
>>> > > > node should or must take in a string-name; these suffer from the
>>> above
>>> > > > design feature due to the non-natural field names that it leads
>>> to. One
>>> > > use
>>> > > > case is an execution node that is joining N input tables on a
>>> > column-name
>>> > > > that exists in all of them. The execution node would make itself
>>> more
>>> > > > convenient for the user by allowing specifying a string-name than
>>> by
>>> > > > specifying N FieldRef instances (each with the same name but on a
>>> > > different
>>> > > > input table) like the above requirement would force. Another use
>>> case
>>> > > for a
>>> > > > string-name is when it is used within the execution node to
>>> dynamically
>>> > > set
>>> > > > up field-names, e.g., a node that operates on the input's columns
>>> whose
>>> > > > name starts with the given string-name or a node that operates on
>>> an
>>> > > input
>>> > > > column whose name is given as data in another input column.
>>> > > > >
>>> > > > > This is the main reason we think there should be appropriate
>>> > (natural)
>>> > > > field names defined for each relation/node.
>>> > > > > Also keep in mind that Substrait isn't exactly a query language
>>> for
>>> > > > user's to be typing by hand.
>>> > > > > Yes, I took that into consideration. Namely, the above discussion
>>> > > refers
>>> > > > to the scenario of a user writing an Ibis expression and the
>>> remaining
>>> > > > steps - Substrait compilation, serialization to protobuf,,
>>> > > deserialization
>>> > > > to an Arrow plan, and execution of the plan - are done
>>> systematically.
>>> > > > >
>>> > > > >
>>> > > > > Yaron.
>>> > > > > ________________________________
>>> > > > > From: Weston Pace <weston.p...@gmail.com>
>>> > > > > Sent: Tuesday, April 19, 2022 1:01 PM
>>> > > > > To: dev@arrow.apache.org <dev@arrow.apache.org>
>>> > > > > Cc: Li Jin <ice.xell...@gmail.com>
>>> > > > > Subject: Re: [C++] output field names in Arrow Substrait
>>> > > > >
>>> > > > > Hi Yaron, I think you might have forgotten the links for
>>> [1][2][3] so
>>> > > > > I'm not entirely sure of the context.  Are you going from
>>> Substrait
>>> > to
>>> > > > > an Arrow execution plan?  Or are you going from an Arrow
>>> execution
>>> > > > > plan to Substrait?
>>> > > > >
>>> > > > > For Substrait -> Arrow most of our execution nodes should take
>>> in a
>>> > > > > FieldRef which can be a name but can also be index-based.  So I
>>> > > > > wouldn't expect Substrait's exclusive use of index-based
>>> references
>>> > to
>>> > > > > be an issue.  Also keep in mind that Substrait isn't exactly a
>>> query
>>> > > > > language for user's to be typing by hand.  So, for example, if a
>>> user
>>> > > > > wants a join and is using Ibis they could type:
>>> > > > >
>>> > > > > table.semi_join(s, table.x == table.y)
>>> > > > >
>>> > > > > Ibis is then responsible for converting "x" and "y" to the
>>> > appropriate
>>> > > > > indices (and it should have all the information needed to do so).
>>> > The
>>> > > > > Substrait plan will refer to these nodes by index and the
>>> > > > > corresponding Arrow execution plan will use integer-based
>>> FieldRef.
>>> > > > >
>>> > > > > For Arrow -> Substrait then I agree that this could become a
>>> problem.
>>> > > > > Right now the Arrow -> Substrait path is mainly used for internal
>>> > > > > testing with the hope that Substrait -> Arrow -> Substrait should
>>> > > > > generally be possible (with zero loss of information).  This
>>> does not
>>> > > > > mean that every Arrow plan will be convertible into Substrait.
>>> That
>>> > > > > is certainly a potential goal, and PRs to add that capability
>>> would
>>> > be
>>> > > > > welcome, but I don't know if anyone working on the
>>> Arrow/Substrait
>>> > > > > integration has that goal in mind.  If that is your goal I might
>>> be
>>> > > > > curious to learn more about your use cases.
>>> > > > >
>>> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <rt...@hotmail.com>
>>> > wrote:
>>> > > > > >
>>> > > > > > Hi,
>>> > > > > >
>>> > > > > >
>>> > > > > > We ran into an issue due to the fact that, for intermediate
>>> > > relations,
>>> > > > Substrait does not automatically compute output field names nor
>>> allows
>>> > > one
>>> > > > to explicitly name output fields [1]. This leads to trouble when
>>> one
>>> > > needs
>>> > > > to refer to these output fields by name [2]. We run into this
>>> trouble
>>> > > when
>>> > > > deserializing in Arrow [3] (based on commit 8e13c2dd).
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > One use case where [1] occurs is:
>>> > > > > >
>>> > > > > > import ibis
>>> > > > > >
>>> > > > > > import pyarrow as pa
>>> > > > > >
>>> > > > > > loaded_table = ibis.local_table(...)
>>> > > > > >
>>> > > > > > cast_table = ibis.mutate(loaded_table,
>>> > > > time=loaded_table.time.cast(pa.int64())
>>> > > > > >
>>> > > > > > Even if loaded_table has nicely named fields, like "time" and
>>> > > "value1"
>>> > > > and "value2", the resulting cast_table (when deserialized in
>>> Arrow) has
>>> > > > inconvenient field names like "FieldPath(1)" and "FieldPath(2)"
>>> for all
>>> > > but
>>> > > > the "time" field. I believe that Substrait has all the information
>>> to
>>> > > > automatically compute the field name "value*" for cast_table, but
>>> it
>>> > > > doesn't do this. Moreover, the caller has to know the schema of
>>> > > > loaded_table in order to set the output field names for
>>> cast_table, and
>>> > > > this is not convenient. When cast_table is an intermediate relation
>>> > > (i.e.,
>>> > > > not the root relation of the plan), Substrait also doesn't allow
>>> the
>>> > > caller
>>> > > > to explicitly name the output fields, and there is no place for
>>> these
>>> > > field
>>> > > > names in the Substrait protobuf.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > One use case where [2] occurs is in a join-type operation over
>>> > > > cast_table. The caller would normally like to use a field name
>>> (and not
>>> > > an
>>> > > > index) to refer to the join-key. Even if the caller knows the field
>>> > name
>>> > > > for the join-key in loaded_table, many field names in cast_table
>>> (when
>>> > > > deserialized in Arrow) are different (and each includes an index
>>> in it)
>>> > > > than those of loaded_table.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > The case of [3] occurs because Arrow deserializes
>>> ExecNodeOption
>>> > > > instances before it has Schema instances at hand. At this stage,
>>> > without
>>> > > > schemata, Arrow cannot compute field names that should be placed
>>> in a
>>> > > > ExecNodeOption that needs them, in particular ProjectNodeOptions.
>>> > > > Currently, Arrow creates an expression vector for the
>>> > ProjectNodeOptions
>>> > > > instance but leaves the names vector empty, and later defaults each
>>> > name
>>> > > to
>>> > > > the ToString of each expression.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > We'd like your input on this issue. Currently, we see two
>>> viable
>>> > ways
>>> > > > to resolve this issue: (1) add output field name for intermediate
>>> > > relations
>>> > > > in the Substrait plan so that Arrow can directly access them, or
>>> (2)
>>> > > > reproduce the field names in Arrow during deserialization by
>>> creating
>>> > > > Schema instances before ExecNodeOptions instances. In any case, we
>>> > think
>>> > > > there should be appropriate field names defined for each
>>> relation/node.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > Note that due to this issue, any Arrow execution node that
>>> prompts
>>> > > the
>>> > > > user to refer to fields by name might not play nicely with
>>> Substrait.
>>> > > > > >
>>> > > > > >
>>> > > > > > Cheers,
>>> > > > > > Yaron.
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [C++] output field names in Arrow Substrait

Reply via email to