> I don't think Substratit cannot convince Spark to go index based. Sorry I meant " I don't think Substrait can convince Spark to go index based"
On Wed, Apr 20, 2022 at 11:24 AM Li Jin <ice.xell...@gmail.com> wrote: > > A third alternative might be to adjust Arrow to move to field-index > based > execution instead of storing schemas in its relational operators. > > This is an interesting idea - I do think there are operators that need > column name information (e.g. a "checkpoint" node that writes to external > storage). In this case, it sounds like any operator that requires knowing > the column name needs to store additional information for the column names > in addition to the field index, in its substrait representation. > > > The names at the relation root are a different story, since they specify > the column names for the produced table, but there's already a mechanism > for specifying those. > > I feel this is a bit limited because I don't think relation roots are the > only thing that produces tables. I can think of two examples in Spark that > produce tables and are not a relation root. (1) A checkpoint node that > produces a table to write to external storage (2) A Python UDF node that > materializes a slice of the table and produces a pd.DataFrame to pass down > to a user function. I think we will see these cases in Arrow as well. I > feel it might make sense to have a subclass of the operator that requires > column names in addition to just root rel. > > *On schema-based vs index-based**:* > > From personal experience working on Spark it would say that having schema > available in each operator is useful for developers - I often print > input/output rows and I feel not having the column names available might > put a tax on the developer. (But maybe there are other ways to solve this). > > Are there other substrait consumers at the moment and how do they deal > with this? For example, if we were to have Spark consumes substrait I'd > imagine there would be similar problems because schema is part of Spark > operators too and I don't think Substratit cannot convince Spark to go > index based. > I'd imagine during deserialization of the subtrait plan to Spark it would > need to compute a schema object for each operator. That's why I think > computing schema during serialization might be a better solution than going > for index based in Arrow. > > > On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <cpcl...@gmail.com> wrote: > >> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten < >> jeroen.van.stra...@gmail.com> >> wrote: >> >> > From a Substrait perspective, it would be up to Ibis to convert the >> > column names to the correct indices. This can and should be transparent >> > to the end user; front-ends already need to know what the schemas are >> > in order to produce a valid Substrait plan, so Ibis should have all the >> > information it needs to do this conversion. >> > >> > I would, however, agree that it's inconsistent for Substrait to require >> > that column names are specified for read relations, but not for the >> > other relations. AFAICT nothing on the consumer side can or should make >> > use of those names. The only use case I can think of is to aid >> > debugging; if that's the only reason they exist, it should indeed be >> > possible to specify the names for each relation, but it should be in >> > the form of as optional, non-functional hints in RelCommon. The names >> > at the relation root are a different story, since they specify the >> > column names for the produced table, but there's already a mechanism >> > for specifying those. No idea if Arrow and Ibis use it (since it's >> > optional), but Substrait supports it. >> > >> >> ReadRels having column names is probably not strictly necessary for all >> the >> ReadRels, but for NamedTable it's necessary. For example, if I wanted to >> represent a ReadRel independent of storage, that an API could deserialize >> for metadata, and present that metadata to the user then you need column >> names. Ibis uses the column names from ReadRels to turn Substrait back >> into >> expressions. >> >> The RelRoot is very much used in ibis. Every table expression compiles to >> a >> RelRoot. >> >> >> > If there *are* special cases I'm not seeing, where the execution engine >> > really does need the names of intermediate columns, I suppose it would >> > be up to the engine to figure out what they would be. The information >> > could also be added via Substrait "advanced extensions," especially >> > when a user overrides a name, but it wouldn't be a good idea for the >> > consumer to rely on the existance of these extensions, for >> > compatibility with other frameworks. >> >> >> > If said special cases turn out to not be so special, a valid case could >> > be made to pull that extension into Substrait itself. If they're only >> > attached wherever the user manually overrides a name, I think the size >> > explosion Phillip is talking about would be avoided. I *really* don't >> > see why anyone would want to match *generated* names in any functional >> > way; that's a recipe for undefined behavior. >> > >> > None of Substrait's built-in things make use of column names though, >> > which implies that any Arrow execution node that *does* need it cannot >> > be described with Substrait, and thus should never appear in an >> > execution plan deserialized from Substrait. If it currently does, >> > that's an issue with Arrow's Substrait consumer. If we'd want to >> > correctly represent those kinds of nodes anyway, we'd already be >> > dealing with an extension relation, to which we could attach whatever >> > information we want. >> > >> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <cpcl...@gmail.com> wrote: >> > >> > > A third alternative might be to adjust Arrow to move to field-index >> based >> > > execution instead of storing schemas in its relational operators. >> > > >> > > I suspect that deserializing the schema on the arrow side is the right >> > > solution in at least the near term. Repeating field names where they >> > aren't >> > > strictly necessary has a tendency to enlarge plans, and the effect >> gets >> > > worse as plans go through transformations. >> > > >> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <rt...@hotmail.com> >> wrote: >> > > >> > > > Hi Weston, >> > > > >> > > > To reiterate, the scenario is not a user writing a Substrait plan by >> > > hand, >> > > > but a user writing an Ibis expression and the remaining steps - >> > Substrait >> > > > compilation, serialization to protobuf, deserialization to an Arrow >> > plan, >> > > > and execution of the plan - are done systematically. This issue is >> > > related >> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons may >> > need >> > > > to bear on this issue. >> > > > Index based field references should still work here. For example, >> if >> > > > I have tables: >> > > > >> > > > Key | LeftPayload -- Key | RightPayload >> > > > >> > > > The join expression would be field(0) == field(2) >> > > > Right, but the point in this use case is that it would not be >> > convenient >> > > > for a user writing an Ibis expression to specify each key of each >> > table, >> > > > especially when plenty of (say, 100) tables are being joined. It >> would >> > be >> > > > much more convenient for the user to specify once the name of the >> key >> > > that >> > > > exists in all tables. This is a use case where the key should (not >> > must) >> > > be >> > > > specified by name for convenience. >> > > > In both of those cases the field names are not part of the plan >> itself. >> > > > In the second use case, the user writing the ibis expression is >> > > specifying >> > > > a string-name as a parameter to a relation that would later get >> passed >> > to >> > > > an Arrow execution node in an options instance and used to >> dynamically >> > > set >> > > > up field-names. This string-name does in fact appear explicitly in >> the >> > > > Substrait plan, not for convenience but of necessity. What does not >> > > appear >> > > > in the Substrait plan are output field-names of intermediate >> relations; >> > > > this in itself is not a problem, because these field-names can (in >> > > > principle) be recomputed to allow them to be matched to the >> dynamically >> > > > set-up field-names. But the current implementation of Arrow (in its >> > > > Substrait module) does not do this because the schemata are not >> > available >> > > > during deserialization, when the option instances (such as >> > > > ProjectNodeOptions) that require the schemata are created. Instead, >> > Arrow >> > > > (in its Substrait module) ends up with non-natural field-names like >> > > > "FieldPath(1)" that fail to be matched. >> > > > >> > > > That's why this is not a Substrait specific problem, but one that is >> > > > related to Ibis, Ibis-Substrait, and Arrow together. We think it >> can be >> > > > resolved either by changing Substrait to assist Arrow by specifying >> the >> > > > field-names of intermediate relations in the plan, so that they are >> > > readily >> > > > available during deserialization, or by changing Arrow (in its >> > Substrait >> > > > module) to reproduce the field-names during deserialization, like by >> > > > computing the schemata before options and node instances are >> created. >> > We >> > > > might come up with other solutions in this discussion, of course. >> > Either >> > > > way, for this second use case, the field-names of intermediate >> > relations >> > > > must be natural; they cannot be left as something like >> "FieldPath(1)". >> > > > >> > > > >> > > > Yaron. >> > > > ________________________________ >> > > > From: Weston Pace <weston.p...@gmail.com> >> > > > Sent: Tuesday, April 19, 2022 7:12 PM >> > > > To: dev@arrow.apache.org <dev@arrow.apache.org> >> > > > Cc: Li Jin <ice.xell...@gmail.com> >> > > > Subject: Re: [C++] output field names in Arrow Substrait >> > > > >> > > > > However, the problem is there are natural cases in >> > > > > which an execution node should or must take in a string-name >> > > > >> > > > If we can come up with such a case then I agree it would be a >> problem >> > > > for Substrait's current definition. I don't think we can come up >> with >> > > > such a case. Every column that can be referenced by name has a >> unique >> > > > index we could use instead. >> > > > >> > > > > One use case is an execution node that is joining N >> > > > > input tables on a column-name that exists in all of them. >> > > > >> > > > Index based field references should still work here. For example, >> if >> > > > I have tables: >> > > > >> > > > Key | LeftPayload -- Key | RightPayload >> > > > >> > > > The join expression would be field(0) == field(2) >> > > > >> > > > > The execution node would make itself more convenient >> > > > > for the user by allowing specifying a string-name than by >> > > > > specifying N FieldRef instances (each with the same >> > > > > name but on a different input table) like the above >> > > > > requirement would force >> > > > >> > > > Substrait plans aren't normally created by humans. I'm not sure >> > > > convenience is a factor here. >> > > > >> > > > > Another use case for a string-name is when it is used within >> > > > > the execution node to dynamically set up field-names, e.g., a >> > > > > node that operates on the input's columns whose name starts >> > > > > with the given string-name or a node that operates on an input >> > > > > column whose name is given as data in another input column. >> > > > >> > > > In both of those cases the field names are not part of the plan >> itself. >> > > > >> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <rt...@hotmail.com> >> wrote: >> > > > > >> > > > > Hi Weston, >> > > > > >> > > > > Thanks for the quick response. >> > > > > I think you might have forgotten the links for [1][2][3] >> > > > > Sorry about the confusion; I use these not as references to links >> but >> > > as >> > > > markers of points I make in the beginning that I elaborate on >> later, in >> > > the >> > > > places where I reuse the markers. >> > > > > Are you going from Substrait to an Arrow execution plan? >> > > > > Yes. >> > > > > For Substrait -> Arrow most of our execution nodes should take in >> a >> > > > FieldRef which can be a name but can also be index-based. So I >> wouldn't >> > > > expect Substrait's exclusive use of index-based references to be an >> > > issue. >> > > > > Yes, I agree this is a design feature of Substrait and that if the >> > > > execution node takes in a FieldRef then there won't be a problem >> with >> > > > Substrait use of an index-based reference for it. If any >> > to-be-developed >> > > > execution node must indeed use a FieldRef when taking in a parameter >> > that >> > > > refers to columns, then this requirement should be documented for >> > > > developers of execution nodes. >> > > > > >> > > > > However, the problem is there are natural cases in which an >> execution >> > > > node should or must take in a string-name; these suffer from the >> above >> > > > design feature due to the non-natural field names that it leads to. >> One >> > > use >> > > > case is an execution node that is joining N input tables on a >> > column-name >> > > > that exists in all of them. The execution node would make itself >> more >> > > > convenient for the user by allowing specifying a string-name than by >> > > > specifying N FieldRef instances (each with the same name but on a >> > > different >> > > > input table) like the above requirement would force. Another use >> case >> > > for a >> > > > string-name is when it is used within the execution node to >> dynamically >> > > set >> > > > up field-names, e.g., a node that operates on the input's columns >> whose >> > > > name starts with the given string-name or a node that operates on an >> > > input >> > > > column whose name is given as data in another input column. >> > > > > >> > > > > This is the main reason we think there should be appropriate >> > (natural) >> > > > field names defined for each relation/node. >> > > > > Also keep in mind that Substrait isn't exactly a query language >> for >> > > > user's to be typing by hand. >> > > > > Yes, I took that into consideration. Namely, the above discussion >> > > refers >> > > > to the scenario of a user writing an Ibis expression and the >> remaining >> > > > steps - Substrait compilation, serialization to protobuf,, >> > > deserialization >> > > > to an Arrow plan, and execution of the plan - are done >> systematically. >> > > > > >> > > > > >> > > > > Yaron. >> > > > > ________________________________ >> > > > > From: Weston Pace <weston.p...@gmail.com> >> > > > > Sent: Tuesday, April 19, 2022 1:01 PM >> > > > > To: dev@arrow.apache.org <dev@arrow.apache.org> >> > > > > Cc: Li Jin <ice.xell...@gmail.com> >> > > > > Subject: Re: [C++] output field names in Arrow Substrait >> > > > > >> > > > > Hi Yaron, I think you might have forgotten the links for >> [1][2][3] so >> > > > > I'm not entirely sure of the context. Are you going from >> Substrait >> > to >> > > > > an Arrow execution plan? Or are you going from an Arrow execution >> > > > > plan to Substrait? >> > > > > >> > > > > For Substrait -> Arrow most of our execution nodes should take in >> a >> > > > > FieldRef which can be a name but can also be index-based. So I >> > > > > wouldn't expect Substrait's exclusive use of index-based >> references >> > to >> > > > > be an issue. Also keep in mind that Substrait isn't exactly a >> query >> > > > > language for user's to be typing by hand. So, for example, if a >> user >> > > > > wants a join and is using Ibis they could type: >> > > > > >> > > > > table.semi_join(s, table.x == table.y) >> > > > > >> > > > > Ibis is then responsible for converting "x" and "y" to the >> > appropriate >> > > > > indices (and it should have all the information needed to do so). >> > The >> > > > > Substrait plan will refer to these nodes by index and the >> > > > > corresponding Arrow execution plan will use integer-based >> FieldRef. >> > > > > >> > > > > For Arrow -> Substrait then I agree that this could become a >> problem. >> > > > > Right now the Arrow -> Substrait path is mainly used for internal >> > > > > testing with the hope that Substrait -> Arrow -> Substrait should >> > > > > generally be possible (with zero loss of information). This does >> not >> > > > > mean that every Arrow plan will be convertible into Substrait. >> That >> > > > > is certainly a potential goal, and PRs to add that capability >> would >> > be >> > > > > welcome, but I don't know if anyone working on the Arrow/Substrait >> > > > > integration has that goal in mind. If that is your goal I might >> be >> > > > > curious to learn more about your use cases. >> > > > > >> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <rt...@hotmail.com> >> > wrote: >> > > > > > >> > > > > > Hi, >> > > > > > >> > > > > > >> > > > > > We ran into an issue due to the fact that, for intermediate >> > > relations, >> > > > Substrait does not automatically compute output field names nor >> allows >> > > one >> > > > to explicitly name output fields [1]. This leads to trouble when one >> > > needs >> > > > to refer to these output fields by name [2]. We run into this >> trouble >> > > when >> > > > deserializing in Arrow [3] (based on commit 8e13c2dd). >> > > > > > >> > > > > > >> > > > > > >> > > > > > One use case where [1] occurs is: >> > > > > > >> > > > > > import ibis >> > > > > > >> > > > > > import pyarrow as pa >> > > > > > >> > > > > > loaded_table = ibis.local_table(...) >> > > > > > >> > > > > > cast_table = ibis.mutate(loaded_table, >> > > > time=loaded_table.time.cast(pa.int64()) >> > > > > > >> > > > > > Even if loaded_table has nicely named fields, like "time" and >> > > "value1" >> > > > and "value2", the resulting cast_table (when deserialized in Arrow) >> has >> > > > inconvenient field names like "FieldPath(1)" and "FieldPath(2)" for >> all >> > > but >> > > > the "time" field. I believe that Substrait has all the information >> to >> > > > automatically compute the field name "value*" for cast_table, but it >> > > > doesn't do this. Moreover, the caller has to know the schema of >> > > > loaded_table in order to set the output field names for cast_table, >> and >> > > > this is not convenient. When cast_table is an intermediate relation >> > > (i.e., >> > > > not the root relation of the plan), Substrait also doesn't allow the >> > > caller >> > > > to explicitly name the output fields, and there is no place for >> these >> > > field >> > > > names in the Substrait protobuf. >> > > > > > >> > > > > > >> > > > > > >> > > > > > One use case where [2] occurs is in a join-type operation over >> > > > cast_table. The caller would normally like to use a field name (and >> not >> > > an >> > > > index) to refer to the join-key. Even if the caller knows the field >> > name >> > > > for the join-key in loaded_table, many field names in cast_table >> (when >> > > > deserialized in Arrow) are different (and each includes an index in >> it) >> > > > than those of loaded_table. >> > > > > > >> > > > > > >> > > > > > >> > > > > > The case of [3] occurs because Arrow deserializes ExecNodeOption >> > > > instances before it has Schema instances at hand. At this stage, >> > without >> > > > schemata, Arrow cannot compute field names that should be placed in >> a >> > > > ExecNodeOption that needs them, in particular ProjectNodeOptions. >> > > > Currently, Arrow creates an expression vector for the >> > ProjectNodeOptions >> > > > instance but leaves the names vector empty, and later defaults each >> > name >> > > to >> > > > the ToString of each expression. >> > > > > > >> > > > > > >> > > > > > >> > > > > > We'd like your input on this issue. Currently, we see two viable >> > ways >> > > > to resolve this issue: (1) add output field name for intermediate >> > > relations >> > > > in the Substrait plan so that Arrow can directly access them, or (2) >> > > > reproduce the field names in Arrow during deserialization by >> creating >> > > > Schema instances before ExecNodeOptions instances. In any case, we >> > think >> > > > there should be appropriate field names defined for each >> relation/node. >> > > > > > >> > > > > > >> > > > > > >> > > > > > Note that due to this issue, any Arrow execution node that >> prompts >> > > the >> > > > user to refer to fields by name might not play nicely with >> Substrait. >> > > > > > >> > > > > > >> > > > > > Cheers, >> > > > > > Yaron. >> > > > >> > > >> > >> >