I think this approach is reasonable. I've added my more detailed thoughts to [1]
[1] https://issues.apache.org/jira/browse/ARROW-18063 On Thu, Oct 13, 2022 at 3:03 PM Li Jin <ice.xell...@gmail.com> wrote: > > After some struggling we finally managed to connect our internal data > source to Acero and executed a data load via pyarrow.substrait.run_query() ! > > We did end up temporarily modifying substrait/options.h source code locally > and made kDefaultNamedTableProvider extern/global. > > But since this doesn't sound like the correct way, I am happy to do this > correctly but someone let me know the correct way :) > > Li > > On Thu, Oct 13, 2022 at 2:01 PM Li Jin <ice.xell...@gmail.com> wrote: > > > Going back to the default_exec_factory_registry idea, I think ultimately > > maybe we want registration API that looks like: > > > > """ > > MetaRegistry* registry = compute::default_meta_registry(); > > registry->RegisterNamedTableProvider(...); > > registry->exec_factory_registry()->AddFactory("my_custom_node", > > MakeMyCustomNode) > > ... > > """ > > > > On Thu, Oct 13, 2022 at 1:32 PM Li Jin <ice.xell...@gmail.com> wrote: > > > >> Weston - was trying the pyarrow approach you suggested: > >> > >> >def custom_source(endpoint): > >> return pc.Declaration("my_custom_source", create_my_custom_options()) > >> > >> (1) I didn't see "Declaration" under pyarrow.compute - which package is > >> this under? > >> (2) What Python object should I return with create_my_custom_options()? > >> Currently I only have a C++ class for my custom option. > >> > >> On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote: > >> > >>> > I may be assuming here but I think your problem is more that there is > >>> no way to more flexibly describe a source in python and less that you > >>> need to change the default. > >>> > >>> Correct. > >>> > >>> > For example, if you could do something like this (in pyarrow) would it > >>> work? > >>> I could try to see if that works. I feel registering the extension in > >>> C++ via one initialization seems cleaner to me because there are many > >>> other > >>> extension points that we initialize (add registering in the > >>> default_exec_factory_registry > >>> similar to > >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30). > >>> Having some of the initialization in Pyarrow and some in Acero is a bit > >>> complicated IMO. > >>> > >>> Maybe the proper long term fix is to add something similar to > >>> compute::default_exec_factory_registry for all extension points (named > >>> table, substrait extension message) and the user can register custom stuff > >>> similar to how they can register via default_exec_factory_registry?) > >>> > >>> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com> > >>> wrote: > >>> > >>>> > Does that sound like a reasonable way to do this? > >>>> > >>>> It's not ideal. > >>>> > >>>> I may be assuming here but I think your problem is more that there is > >>>> no way to more flexibly describe a source in python and less that you > >>>> need to change the default. > >>>> > >>>> For example, if you could do something like this (in pyarrow) would it > >>>> work? > >>>> > >>>> ``` > >>>> def custom_source(endpoint): > >>>> return pc.Declaration("my_custom_source", create_my_custom_options()) > >>>> > >>>> def table_provider(names): > >>>> return custom_sources[names[0]] > >>>> > >>>> pa.substrait.run_query(my_plan, table_provider=table_provider) > >>>> ``` > >>>> > >>>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote: > >>>> > > >>>> > We did some work around this recently and think there needs to be some > >>>> > small change to allow users to override this default provider. I will > >>>> > explain in more details: > >>>> > > >>>> > (1) Since the variable is defined as static in the substrait/options.h > >>>> > file, each translation unit will have a separate copy of the > >>>> > kDefaultNamedTableProvider > >>>> > variable. And therefore, the user cannot really change the default > >>>> that is > >>>> > used here: > >>>> > > >>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 > >>>> > > >>>> > In order to allow user to override the kDefaultNamedTableProvider (and > >>>> > change the behavior of > >>>> > > >>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 > >>>> > to use a custom NamedTableProvider), we need to > >>>> > (1) in substrait/options.hh, change the definition of > >>>> > kDefaultNamedTableProvider to be an extern declaration > >>>> > (2) move the definition of kDefaultNamedTableProvider to an > >>>> > substrait/options.cc file > >>>> > > >>>> > We are still testing this but based on my limited C++ knowledge, I > >>>> > think this would allow users to do > >>>> > """ > >>>> > #include "arrow/engine/substrait/options.h" > >>>> > > >>>> > void initialize() { > >>>> > arrow::engine::kDefaultNamedTableProvider = > >>>> > some_custom_name_table_provider; > >>>> > } > >>>> > """ > >>>> > And then calling `pa.substrat.run_query" should pick up the custom > >>>> name > >>>> > table provider. > >>>> > > >>>> > Does that sound like a reasonable way to do this? > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: > >>>> > > >>>> > > Thanks both. I think NamedTableProvider is close to what I want, > >>>> and like > >>>> > > Weston said, the tricky bit is how to use a custom > >>>> NamedTableProvider when > >>>> > > calling the pyarrow substrait API. > >>>> > > > >>>> > > It's a little hacky but I *think* I can override the value > >>>> "kDefaultNamedTableProvider" > >>>> > > here and pass "table_provider=None" then it "should" work: > >>>> > > > >>>> > > > >>>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 > >>>> > > > >>>> > > I am going to give that a shot once I pull/build Arrow default into > >>>> our > >>>> > > internal build system. > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman < > >>>> bengil...@gmail.com> > >>>> > > wrote: > >>>> > > > >>>> > >> It seems to me that your use case could be handled by defining a > >>>> custom > >>>> > >> NamedTableProvider and > >>>> > >> assigning this to ConversionOptions::named_table_provider. This > >>>> was added > >>>> > >> in > >>>> > >> https://github.com/apache/arrow/pull/13613 to provide user > >>>> configurable > >>>> > >> dispatching for named tables; > >>>> > >> if it doesn't address your use case then we might want to create a > >>>> JIRA to > >>>> > >> extend it. > >>>> > >> > >>>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> > >>>> wrote: > >>>> > >> > >>>> > >> > I did some more digging into this and have some ideas - > >>>> > >> > > >>>> > >> > Currently, the logic for deserialization named table is: > >>>> > >> > > >>>> > >> > > >>>> > >> > >>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 > >>>> > >> > and it will look up named tables from a user provided dictionary > >>>> from > >>>> > >> > string -> arrow Table. > >>>> > >> > > >>>> > >> > My idea is to make some short term changes to allow named tables > >>>> to be > >>>> > >> > dispatched differently (This logic can be reverted/removed once > >>>> we > >>>> > >> figure > >>>> > >> > out the proper way to support custom data sources, perhaps via > >>>> substrait > >>>> > >> > Extensions.), specifically: > >>>> > >> > > >>>> > >> > (1) The user creates named table with uris for custom data > >>>> source, i.e., > >>>> > >> > "my_datasource://tablename?begin=20200101&end=20210101" > >>>> > >> > (2) In the substrait consumer, allowing user to register custom > >>>> dispatch > >>>> > >> > rules based on uri scheme (similar to how exec node registry > >>>> works), > >>>> > >> i.e., > >>>> > >> > sth like: > >>>> > >> > > >>>> > >> > substrait_named_table_registry.add("my_datasource", > >>>> deser_my_datasource) > >>>> > >> > and deser_my_datasource is a function that takes the NamedTable > >>>> > >> substrait > >>>> > >> > message and returns a declaration. > >>>> > >> > > >>>> > >> > I know doing this just for named tables might not be a very > >>>> general > >>>> > >> > solution but seems the easiest path forward, and we can always > >>>> remove > >>>> > >> this > >>>> > >> > later in favor of a more generic solution. > >>>> > >> > > >>>> > >> > Thoughts? > >>>> > >> > > >>>> > >> > Li > >>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> > >> > > >>>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> > >>>> wrote: > >>>> > >> > > >>>> > >> > > Hello! > >>>> > >> > > > >>>> > >> > > I am working on adding a custom data source node in Acero. I > >>>> have a > >>>> > >> few > >>>> > >> > > previous threads related to this topic. > >>>> > >> > > > >>>> > >> > > Currently, I am able to register my custom factory method with > >>>> Acero > >>>> > >> and > >>>> > >> > > create a Custom source node, i.e., I can register and execute > >>>> this > >>>> > >> with > >>>> > >> > > Acero: > >>>> > >> > > > >>>> > >> > > MySourceNodeOptions source_options = ... > >>>> > >> > > Declaration source{"my_source", source_option} > >>>> > >> > > > >>>> > >> > > The next step I want to do is to pass this through to the Acero > >>>> > >> substrait > >>>> > >> > > consumer. From previous discussions, I am going to use > >>>> "NamedTable '' > >>>> > >> as > >>>> > >> > a > >>>> > >> > > temporary way to define my custom data source in substrait. My > >>>> > >> question > >>>> > >> > is > >>>> > >> > > this: > >>>> > >> > > > >>>> > >> > > What I need to do in substrait in order to register my own > >>>> substrait > >>>> > >> > > consumer rule/function for deserializing my custom named table > >>>> > >> protobuf > >>>> > >> > > message into the declaration above. If this is not supported > >>>> right > >>>> > >> now, > >>>> > >> > > what is a reasonable/minimal change to make this work? > >>>> > >> > > > >>>> > >> > > Thanks, > >>>> > >> > > Li > >>>> > >> > > > >>>> > >> > > >>>> > >> > >>>> > > > >>>> > >>>