[C++] output field names in Arrow Substrait

Yaron Gvili Tue, 19 Apr 2022 09:10:28 -0700

Hi,


We ran into an issue due to the fact that, for intermediate relations, 
Substrait does not automatically compute output field names nor allows one to 
explicitly name output fields [1]. This leads to trouble when one needs to 
refer to these output fields by name [2]. We run into this trouble when 
deserializing in Arrow [3] (based on commit 8e13c2dd).



One use case where [1] occurs is:

import ibis

import pyarrow as pa

loaded_table = ibis.local_table(...)

cast_table = ibis.mutate(loaded_table, time=loaded_table.time.cast(pa.int64())

Even if loaded_table has nicely named fields, like "time" and "value1" and 
"value2", the resulting cast_table (when deserialized in Arrow) has 
inconvenient field names like "FieldPath(1)" and "FieldPath(2)" for all but the 
"time" field. I believe that Substrait has all the information to automatically 
compute the field name "value*" for cast_table, but it doesn't do this. 
Moreover, the caller has to know the schema of loaded_table in order to set the 
output field names for cast_table, and this is not convenient. When cast_table 
is an intermediate relation (i.e., not the root relation of the plan), 
Substrait also doesn't allow the caller to explicitly name the output fields, 
and there is no place for these field names in the Substrait protobuf.



One use case where [2] occurs is in a join-type operation over cast_table. The 
caller would normally like to use a field name (and not an index) to refer to 
the join-key. Even if the caller knows the field name for the join-key in 
loaded_table, many field names in cast_table (when deserialized in Arrow) are 
different (and each includes an index in it) than those of loaded_table.



The case of [3] occurs because Arrow deserializes ExecNodeOption instances 
before it has Schema instances at hand. At this stage, without schemata, Arrow 
cannot compute field names that should be placed in a ExecNodeOption that needs 
them, in particular ProjectNodeOptions. Currently, Arrow creates an expression 
vector for the ProjectNodeOptions instance but leaves the names vector empty, 
and later defaults each name to the ToString of each expression.



We'd like your input on this issue. Currently, we see two viable ways to 
resolve this issue: (1) add output field name for intermediate relations in the 
Substrait plan so that Arrow can directly access them, or (2) reproduce the 
field names in Arrow during deserialization by creating Schema instances before 
ExecNodeOptions instances. In any case, we think there should be appropriate 
field names defined for each relation/node.



Note that due to this issue, any Arrow execution node that prompts the user to 
refer to fields by name might not play nicely with Substrait.


Cheers,
Yaron.

[C++] output field names in Arrow Substrait

Reply via email to