Thanks a lot for your reply, Niranda and Weston. On Thu, Aug 25, 2022 at 1:31 AM Weston Pace <weston.p...@gmail.com> wrote:
> I don't know of any work being done to turn Acero into a distributed > query engine. > > However, I would hope that Acero can be used in a distributed query > engine, and would be a useful component. > > If there are features that Acero would need in this environment (e.g. > some kind of exec node for specialized transmission or partitioned > transmission) then please feel free to create JIRA tickets describing > what you would like to see. > > On Wed, Aug 24, 2022 at 7:18 AM Niranda Perera <niranda.per...@gmail.com> > wrote: > > > > Hi Jayeet, > > > > AFAIU, Acero work mainly focuses on single node multithreaded execution > > based on morsel driven parallelism [1]. > > In your case, there are multiple options IMO. Ex. just use 2 nodes which > do > > filtering parallely, and then node0 does the join (this reduces > > communication). Better yet, if you could use distributed memory > > computation for the hash join which uses both nodes (which is not > supported > > yet in arrow). There are several other compute engines that support these > > types of execution on top of arrow dataformat (eg: Cylon which I'm > working > > on ATM) > > > > [1] https://dl.acm.org/doi/abs/10.1145/2588555.2610507 > > > > On Wed, Aug 24, 2022 at 10:00 AM Jayjeet Chakraborty < > > jayjeetchakrabort...@gmail.com> wrote: > > > > > Hi Arrow Community, > > > > > > With the release of Acero, we were wondering if Acero can be used in a > > > distributed environment as for now it looks like Acero is only > intended for > > > a local context. For example, if we have a query plan with a hash join > node > > > at the root and multiple filter project nodes on each sides of the > tree, > > > each side having a data source, how can we distribute the query plan > > > between 3 nodes: 2 nodes containing data sources and executing the > filter > > > and project parts of the query plan in parallel while 1 node serving > as the > > > compute node, performing only the join operation on the results from > the > > > other 2 nodes. As per my understanding, we need some form of RPC > mechanism > > > between the ExecNodes of an ExecPlan and would probably be integrated > > > within the Flight framework. Is that the right way to think about it ? > Do > > > you think that is something the Arrow community would be interested in > if > > > not already planning for it ? Thanks. > > > > > > Jayjeet Chakraborty > > > > > > > > > -- > > > *Jayjeet Chakraborty* > > > CS PhD student > > > UC Santa Cruz > > > California, USA > > > > > > > > > -- > > Niranda Perera > > https://niranda.dev/ > > @n1r44 <https://twitter.com/N1R44> > -- *Jayjeet Chakraborty* CS PhD student UC Santa Cruz California, USA