Re: Integration between ibis-substrait and Acero

Will Jones Wed, 05 Oct 2022 13:27:15 -0700

Some of these issues may be fixed in ARROW-17915 [1].

[1] https://github.com/apache/arrow/pull/14295


On Wed, Oct 5, 2022 at 12:07 PM Will Jones <will.jones...@gmail.com> wrote:

> I can confirm that fixes that issue in the simple case. But if result is
> either of these, we get an error:
>
> result = compiler.compile(t.select("b"))
> # leads to:
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query
>     c_reader = GetResultValue(c_res_reader)
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>     return check_status(status)
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>     raise ArrowInvalid(message)
> pyarrow.lib.ArrowInvalid: ConsumingSinkNode with mismatched number of names
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:84
>  plan_->StartProducing()
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:131
>  executor.Execute()
>
>
> result = compiler.compile(t.group_by("a").mutate(z = t.b.sum()))
> # leads to:
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/_substrait.pyx", line 140, in pyarrow._substrait.run_query
>     c_reader = GetResultValue(c_res_reader)
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>     return check_status(status)
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>     raise ArrowInvalid(message)
> pyarrow.lib.ArrowInvalid: Invalid column index to add field.
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:338
>  project_schema->AddField( num_columns +
> static_cast<int>(project.expressions().size()) - 1,
> std::move(project_field))
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/serde.cc:156
>  FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(),
> ext_set, conversion_options)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:106
>  engine::DeserializePlans(substrait_buffer, consumer_factory, registry,
> nullptr, conversion_options_)
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:130
>  executor.Init(substrait_buffer, registry)
>
> On Wed, Oct 5, 2022 at 11:58 AM Li Jin <ice.xell...@gmail.com> wrote:
>
>> Ok I think I got a working version now:
>>
>>         t = ibis.table([("a", "int64"), ("b", "int64")], name="table0")
>>
>>         test_table_0 = pa.Table.from_pydict({"a": [1, 2, 3], "b": [4, 5,
>> 6]})
>>
>>
>>
>>         result = self.compiler.compile(t)
>>
>>
>>
>>         def table_provider(names):
>>
>>             if not names:
>>
>>                 raise Exception("No names provided")
>>
>>             elif names[0] == 'table0':
>>
>>                 return test_table_0
>>
>>             else:
>>
>>                 raise Exception(f"Unknown table name {names}")
>>
>>
>>
>>         reader =
>> pa.substrait.run_query(pa.py_buffer(result.SerializeToString()),
>> table_provider)
>>
>>         result_table = reader.read_all()
>>
>>
>>
>>         self.assertTrue(result_table == test_table_0)
>>
>> First successful run with ibis/substrait/acero - Hooray
>>
>> On Wed, Oct 5, 2022 at 2:33 PM Li Jin <ice.xell...@gmail.com> wrote:
>>
>> > Hmm. Thanks for the update - Now I searched the code more, it seems
>> > perhaps I should be using "compile" rather than "translate";
>> >
>> >
>> >
>> https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/compiler/core.py#L82
>> >
>> > Let me try some more
>> >
>> > On Wed, Oct 5, 2022 at 1:42 PM Will Jones <will.jones...@gmail.com>
>> wrote:
>> >
>> >> Hi Li Jin,
>> >>
>> >> The original segfault seems to occur because you are passing a Python
>> >> bytes
>> >> object and not a PyArrow Buffer object. You can wrap the bytes object
>> >> using
>> >> pa.py_buffer():
>> >>
>> >> pa.substrait.run_query(pa.py_buffer(result_bytes), table_provider)
>> >>
>> >>
>> >> That being said, when I run your full example with that, we now get a
>> >> different error similar to what you get when you pass in through JSON:
>> >>
>> >> Traceback (most recent call last):
>> >>   File "<stdin>", line 1, in <module>
>> >>   File "pyarrow/_substrait.pyx", line 140, in
>> pyarrow._substrait.run_query
>> >>     c_reader = GetResultValue(c_res_reader)
>> >>   File "pyarrow/error.pxi", line 144, in
>> >> pyarrow.lib.pyarrow_internal_check_status
>> >>     return check_status(status)
>> >>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>> >>     raise ArrowInvalid(message)
>> >> pyarrow.lib.ArrowInvalid: ExecPlan has no node
>> >>
>> >>
>> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:82
>> >>  plan_->Validate()
>> >>
>> >>
>> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:131
>> >>  executor.Execute()
>> >>
>> >>
>> >> We get the same error even if I add operations onto the plan:
>> >>
>> >> result = translate(t.group_by("a").mutate(z = t.b.sum()), compiler)
>> >> print(result)
>> >>
>> >>
>> >> project {
>> >>   input {
>> >>     read {
>> >>       base_schema {
>> >>         names: "a"
>> >>         names: "b"
>> >>         struct {
>> >>           types {
>> >>             i64 {
>> >>               nullability: NULLABILITY_NULLABLE
>> >>             }
>> >>           }
>> >>           types {
>> >>             i64 {
>> >>               nullability: NULLABILITY_NULLABLE
>> >>             }
>> >>           }
>> >>           nullability: NULLABILITY_REQUIRED
>> >>         }
>> >>       }
>> >>       named_table {
>> >>         names: "table0"
>> >>       }
>> >>     }
>> >>   }
>> >>   expressions {
>> >>     selection {
>> >>       direct_reference {
>> >>         struct_field {
>> >>         }
>> >>       }
>> >>       root_reference {
>> >>       }
>> >>     }
>> >>   }
>> >>   expressions {
>> >>     selection {
>> >>       direct_reference {
>> >>         struct_field {
>> >>           field: 1
>> >>         }
>> >>       }
>> >>       root_reference {
>> >>       }
>> >>     }
>> >>   }
>> >>   expressions {
>> >>     window_function {
>> >>       function_reference: 1
>> >>       partitions {
>> >>         selection {
>> >>           direct_reference {
>> >>             struct_field {
>> >>             }
>> >>           }
>> >>           root_reference {
>> >>           }
>> >>         }
>> >>       }
>> >>       upper_bound {
>> >>         unbounded {
>> >>         }
>> >>       }
>> >>       lower_bound {
>> >>         unbounded {
>> >>         }
>> >>       }
>> >>       phase: AGGREGATION_PHASE_INITIAL_TO_RESULT
>> >>       output_type {
>> >>         i64 {
>> >>           nullability: NULLABILITY_NULLABLE
>> >>         }
>> >>       }
>> >>       arguments {
>> >>         value {
>> >>           selection {
>> >>             direct_reference {
>> >>               struct_field {
>> >>                 field: 1
>> >>               }
>> >>             }
>> >>             root_reference {
>> >>             }
>> >>           }
>> >>         }
>> >>       }
>> >>     }
>> >>   }
>> >> }
>> >>
>> >>
>> >> Full reproduction:
>> >>
>> >> import pyarrow as pa
>> >> import pyarrow.substrait
>> >> import ibis
>> >> from ibis_substrait.compiler.core import SubstraitCompiler
>> >> from ibis_substrait.compiler.translate import translate
>> >>
>> >>
>> >> compiler = SubstraitCompiler()
>> >>
>> >>
>> >> t = ibis.table([("a", "int64"), ("b", "int64")], name="table0")
>> >> result = translate(t.group_by("a").mutate(z = t.b.sum()), compiler)
>> >>
>> >> def table_provider(names):
>> >>     if not names:
>> >>         raise Exception("No names provided")
>> >>     elif names[0] == 'table0':
>> >>         return test_table_0
>> >>     else:
>> >>         raise Exception(f"Unknown table name {names}")
>> >>
>> >>
>> >> test_table_0 = pa.Table.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>> >>
>> >> result_bytes = result.SerializeToString()
>> >>
>> >> pa.substrait.run_query(pa.py_buffer(result_bytes), table_provider)
>> >>
>> >> Best,
>> >>
>> >> Will Jones
>> >>
>> >> On Tue, Oct 4, 2022 at 12:30 PM Li Jin <ice.xell...@gmail.com> wrote:
>> >>
>> >> > For reference, this is the "relations" entry that I was referring to:
>> >> >
>> >> >
>> >>
>> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_substrait.py#L186
>> >> >
>> >> > On Tue, Oct 4, 2022 at 3:28 PM Li Jin <ice.xell...@gmail.com> wrote:
>> >> >
>> >> > > So I made some progress with updated code:
>> >> > >
>> >> > >         t = ibis.table([("a", "int64"), ("b", "int64")],
>> >> name="table0")
>> >> > >
>> >> > >         test_table_0 = pa.Table.from_pydict({"a": [1, 2, 3], "b":
>> [4,
>> >> 5,
>> >> > > 6]})
>> >> > >
>> >> > >
>> >> > >
>> >> > >         result = translate(t, self.compiler)
>> >> > >
>> >> > >
>> >> > >
>> >> > >         def table_provider(names):
>> >> > >
>> >> > >             if not names:
>> >> > >
>> >> > >                 raise Exception("No names provided")
>> >> > >
>> >> > >             elif names[0] == 'table0':
>> >> > >
>> >> > >                 return test_table_0
>> >> > >
>> >> > >             else:
>> >> > >
>> >> > >                 raise Exception(f"Unknown table name {names}")
>> >> > >
>> >> > >
>> >> > >
>> >> > >         print(result)
>> >> > >
>> >> > >         result_buf =
>> >> > > pa._substrait._parse_json_plan(tobytes(MessageToJson(result)))
>> >> > >
>> >> > >
>> >> > >
>> >> > >         pa.substrait.run_query(result_buf, table_provider)
>> >> > >
>> >> > > I think now the plan is passed properly and I got a "ArrowInvalid:
>> >> Empty
>> >> > > substrait plan is passed"
>> >> > >
>> >> > >
>> >> > > Looking the plan reproduces by ibis-substrait, it looks like
>> doesn't
>> >> > match
>> >> > > the expected format of Acero consumer. In particular, it looks like
>> >> the
>> >> > > plan produced by ibis-substrait doesn't have a "relations" entry -
>> any
>> >> > > thoughts on how this can be fixed? (I don't know if I am using the
>> API
>> >> > > wrong or some format inconsistency between the two)
>> >> > >
>> >> > > On Tue, Oct 4, 2022 at 1:54 PM Li Jin <ice.xell...@gmail.com>
>> wrote:
>> >> > >
>> >> > >> Hi,
>> >> > >>
>> >> > >> I am testing integration between ibis-substrait and Acero but hit
>> a
>> >> > >> segmentation fault. I think this might be cause the way I am
>> >> > >> integrating these two libraries are wrong, here is my code:
>> >> > >>
>> >> > >> Li Jin
>> >> > >> 1:51 PM (1 minute ago)
>> >> > >> to me
>> >> > >>
>> >> > >> class BasicTests(unittest.TestCase):
>> >> > >>
>> >> > >>     """Test basic features"""
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>     @classmethod
>> >> > >>
>> >> > >>     def setUpClass(cls):
>> >> > >>
>> >> > >>         cls.compiler = SubstraitCompiler()
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>     def test_named_table(self):
>> >> > >>
>> >> > >>         """Test basic"""
>> >> > >>
>> >> > >>         t = ibis.table([("a", "int64"), ("b", "int64")],
>> >> name="table0")
>> >> > >>
>> >> > >>         result = translate(t, self.compiler)
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>         def table_provider(names):
>> >> > >>
>> >> > >>             if not names:
>> >> > >>
>> >> > >>                 raise Exception("No names provided")
>> >> > >>
>> >> > >>             elif names[0] == 'table0':
>> >> > >>
>> >> > >>                 return test_table_0
>> >> > >>
>> >> > >>             else:
>> >> > >>
>> >> > >>                 raise Exception(f"Unknown table name {names}")
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>         test_table_0 = pa.Table.from_pydict({"a": [1, 2, 3], "b":
>> >> [4, 5,
>> >> > >> 6]})
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>         print(type(result))
>> >> > >>
>> >> > >>         print(result)
>> >> > >>
>> >> > >>         result_bytes = result.SerializeToString()
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>         pa.substrait.run_query(result_bytes, table_provider)
>> >> > >>
>> >> > >>
>> >> > >> I wonder if someone has tried integration between these two before
>> >> and
>> >> > >> can share some working code?
>> >> > >>
>> >> > >
>> >> >
>> >>
>> >
>>
>

Re: Integration between ibis-substrait and Acero

Reply via email to