I think this approach is reasonable.  I've added my more detailed
thoughts to [1]

[1] https://issues.apache.org/jira/browse/ARROW-18063

On Thu, Oct 13, 2022 at 3:03 PM Li Jin <ice.xell...@gmail.com> wrote:
>
> After some struggling we finally managed to connect our internal data
> source to Acero and executed a data load via pyarrow.substrait.run_query() !
>
> We did end up temporarily modifying substrait/options.h source code locally
> and made kDefaultNamedTableProvider extern/global.
>
> But since this doesn't sound like the correct way, I am happy to do this
> correctly but someone let me know the correct way :)
>
> Li
>
> On Thu, Oct 13, 2022 at 2:01 PM Li Jin <ice.xell...@gmail.com> wrote:
>
> > Going back to the default_exec_factory_registry idea, I think ultimately
> > maybe we want registration API that looks like:
> >
> > """
> > MetaRegistry* registry = compute::default_meta_registry();
> > registry->RegisterNamedTableProvider(...);
> > registry->exec_factory_registry()->AddFactory("my_custom_node",
> > MakeMyCustomNode)
> > ...
> > """
> >
> > On Thu, Oct 13, 2022 at 1:32 PM Li Jin <ice.xell...@gmail.com> wrote:
> >
> >> Weston - was trying the pyarrow approach you suggested:
> >>
> >> >def custom_source(endpoint):
> >>   return pc.Declaration("my_custom_source", create_my_custom_options())
> >>
> >> (1) I didn't see "Declaration" under pyarrow.compute - which package is
> >> this under?
> >> (2) What Python object should I return with  create_my_custom_options()?
> >> Currently I only have a C++ class for my custom option.
> >>
> >> On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote:
> >>
> >>> > I may be assuming here but I think your problem is more that there is
> >>> no way to more flexibly describe a source in python and less that you
> >>> need to change the default.
> >>>
> >>> Correct.
> >>>
> >>> > For example, if you could do something like this (in pyarrow) would it
> >>> work?
> >>> I could try to see if that works. I feel registering the extension in
> >>> C++ via one initialization seems cleaner to me because there are many 
> >>> other
> >>> extension points that we initialize (add registering in the 
> >>> default_exec_factory_registry
> >>> similar to
> >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30).
> >>> Having some of the initialization in Pyarrow and some in Acero is a bit
> >>> complicated IMO.
> >>>
> >>> Maybe the proper long term fix is to add something similar to
> >>> compute::default_exec_factory_registry for all extension points (named
> >>> table, substrait extension message) and the user can register custom stuff
> >>> similar to how they can register via default_exec_factory_registry?)
> >>>
> >>> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com>
> >>> wrote:
> >>>
> >>>> > Does that sound like a reasonable way to do this?
> >>>>
> >>>> It's not ideal.
> >>>>
> >>>> I may be assuming here but I think your problem is more that there is
> >>>> no way to more flexibly describe a source in python and less that you
> >>>> need to change the default.
> >>>>
> >>>> For example, if you could do something like this (in pyarrow) would it
> >>>> work?
> >>>>
> >>>> ```
> >>>> def custom_source(endpoint):
> >>>>   return pc.Declaration("my_custom_source", create_my_custom_options())
> >>>>
> >>>> def table_provider(names):
> >>>>   return custom_sources[names[0]]
> >>>>
> >>>> pa.substrait.run_query(my_plan, table_provider=table_provider)
> >>>> ```
> >>>>
> >>>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote:
> >>>> >
> >>>> > We did some work around this recently and think there needs to be some
> >>>> > small change to allow users to override this default provider. I will
> >>>> > explain in more details:
> >>>> >
> >>>> > (1) Since the variable is defined as static in the substrait/options.h
> >>>> > file, each translation unit will have a separate copy of the
> >>>> > kDefaultNamedTableProvider
> >>>> > variable. And therefore, the user cannot really change the default
> >>>> that is
> >>>> > used here:
> >>>> >
> >>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
> >>>> >
> >>>> > In order to allow user to override the kDefaultNamedTableProvider (and
> >>>> > change the behavior of
> >>>> >
> >>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
> >>>> > to use a custom NamedTableProvider), we need to
> >>>> > (1) in substrait/options.hh, change the definition of
> >>>> > kDefaultNamedTableProvider to be an extern declaration
> >>>> > (2) move the definition of kDefaultNamedTableProvider to an
> >>>> > substrait/options.cc file
> >>>> >
> >>>> > We are still testing this but based on my limited C++ knowledge, I
> >>>> > think this would allow users to do
> >>>> > """
> >>>> > #include "arrow/engine/substrait/options.h"
> >>>> >
> >>>> > void initialize() {
> >>>> >     arrow::engine::kDefaultNamedTableProvider =
> >>>> > some_custom_name_table_provider;
> >>>> > }
> >>>> > """
> >>>> > And then calling `pa.substrat.run_query" should pick up the custom
> >>>> name
> >>>> > table provider.
> >>>> >
> >>>> > Does that sound like a reasonable way to do this?
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
> >>>> >
> >>>> > > Thanks both. I think NamedTableProvider is close to what I want,
> >>>> and like
> >>>> > > Weston said, the tricky bit is how to use a custom
> >>>> NamedTableProvider when
> >>>> > > calling the pyarrow substrait API.
> >>>> > >
> >>>> > > It's a little hacky but I *think* I can override the value
> >>>> "kDefaultNamedTableProvider"
> >>>> > > here and pass "table_provider=None" then it "should" work:
> >>>> > >
> >>>> > >
> >>>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
> >>>> > >
> >>>> > > I am going to give that a shot once I pull/build Arrow default into
> >>>> our
> >>>> > > internal build system.
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <
> >>>> bengil...@gmail.com>
> >>>> > > wrote:
> >>>> > >
> >>>> > >> It seems to me that your use case could be handled by defining a
> >>>> custom
> >>>> > >> NamedTableProvider and
> >>>> > >> assigning this to ConversionOptions::named_table_provider. This
> >>>> was added
> >>>> > >> in
> >>>> > >> https://github.com/apache/arrow/pull/13613 to provide user
> >>>> configurable
> >>>> > >> dispatching for named tables;
> >>>> > >> if it doesn't address your use case then we might want to create a
> >>>> JIRA to
> >>>> > >> extend it.
> >>>> > >>
> >>>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com>
> >>>> wrote:
> >>>> > >>
> >>>> > >> > I did some more digging into this and have some ideas -
> >>>> > >> >
> >>>> > >> > Currently, the logic for deserialization named table is:
> >>>> > >> >
> >>>> > >> >
> >>>> > >>
> >>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
> >>>> > >> > and it will look up named tables from a user provided dictionary
> >>>> from
> >>>> > >> > string -> arrow Table.
> >>>> > >> >
> >>>> > >> > My idea is to make some short term changes to allow named tables
> >>>> to be
> >>>> > >> > dispatched differently (This logic can be reverted/removed once
> >>>> we
> >>>> > >> figure
> >>>> > >> > out the proper way to support custom data sources, perhaps via
> >>>> substrait
> >>>> > >> > Extensions.), specifically:
> >>>> > >> >
> >>>> > >> > (1) The user creates named table with uris for custom data
> >>>> source, i.e.,
> >>>> > >> > "my_datasource://tablename?begin=20200101&end=20210101"
> >>>> > >> > (2) In the substrait consumer, allowing user to register custom
> >>>> dispatch
> >>>> > >> > rules based on uri scheme (similar to how exec node registry
> >>>> works),
> >>>> > >> i.e.,
> >>>> > >> > sth like:
> >>>> > >> >
> >>>> > >> > substrait_named_table_registry.add("my_datasource",
> >>>> deser_my_datasource)
> >>>> > >> > and deser_my_datasource is a function that takes the NamedTable
> >>>> > >> substrait
> >>>> > >> > message and returns a declaration.
> >>>> > >> >
> >>>> > >> > I know doing this just for named tables might not be a very
> >>>> general
> >>>> > >> > solution but seems the easiest path forward, and we can always
> >>>> remove
> >>>> > >> this
> >>>> > >> > later in favor of a more generic solution.
> >>>> > >> >
> >>>> > >> > Thoughts?
> >>>> > >> >
> >>>> > >> > Li
> >>>> > >> >
> >>>> > >> >
> >>>> > >> >
> >>>> > >> >
> >>>> > >> >
> >>>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com>
> >>>> wrote:
> >>>> > >> >
> >>>> > >> > > Hello!
> >>>> > >> > >
> >>>> > >> > > I am working on adding a custom data source node in Acero. I
> >>>> have a
> >>>> > >> few
> >>>> > >> > > previous threads related to this topic.
> >>>> > >> > >
> >>>> > >> > > Currently, I am able to register my custom factory method with
> >>>> Acero
> >>>> > >> and
> >>>> > >> > > create a Custom source node, i.e., I can register and execute
> >>>> this
> >>>> > >> with
> >>>> > >> > > Acero:
> >>>> > >> > >
> >>>> > >> > > MySourceNodeOptions source_options = ...
> >>>> > >> > > Declaration source{"my_source", source_option}
> >>>> > >> > >
> >>>> > >> > > The next step I want to do is to pass this through to the Acero
> >>>> > >> substrait
> >>>> > >> > > consumer. From previous discussions, I am going to use
> >>>> "NamedTable ''
> >>>> > >> as
> >>>> > >> > a
> >>>> > >> > > temporary way to define my custom data source in substrait. My
> >>>> > >> question
> >>>> > >> > is
> >>>> > >> > > this:
> >>>> > >> > >
> >>>> > >> > > What I need to do in substrait in order to register my own
> >>>> substrait
> >>>> > >> > > consumer rule/function for deserializing my custom named table
> >>>> > >> protobuf
> >>>> > >> > > message into the declaration above. If this is not supported
> >>>> right
> >>>> > >> now,
> >>>> > >> > > what is a reasonable/minimal change to make this work?
> >>>> > >> > >
> >>>> > >> > > Thanks,
> >>>> > >> > > Li
> >>>> > >> > >
> >>>> > >> >
> >>>> > >>
> >>>> > >
> >>>>
> >>>

Reply via email to