Thanks a lot for your reply, Niranda and Weston.

On Thu, Aug 25, 2022 at 1:31 AM Weston Pace <weston.p...@gmail.com> wrote:

> I don't know of any work being done to turn Acero into a distributed
> query engine.
>
> However, I would hope that Acero can be used in a distributed query
> engine, and would be a useful component.
>
> If there are features that Acero would need in this environment (e.g.
> some kind of exec node for specialized transmission or partitioned
> transmission) then please feel free to create JIRA tickets describing
> what you would like to see.
>
> On Wed, Aug 24, 2022 at 7:18 AM Niranda Perera <niranda.per...@gmail.com>
> wrote:
> >
> > Hi Jayeet,
> >
> > AFAIU, Acero work mainly focuses on single node multithreaded execution
> > based on morsel driven parallelism [1].
> > In your case, there are multiple options IMO. Ex. just use 2 nodes which
> do
> > filtering parallely, and then node0 does the join (this reduces
> > communication).  Better yet, if you could use distributed memory
> > computation for the hash join which uses both nodes (which is not
> supported
> > yet in arrow). There are several other compute engines that support these
> > types of execution on top of arrow dataformat (eg: Cylon which I'm
> working
> > on ATM)
> >
> > [1] https://dl.acm.org/doi/abs/10.1145/2588555.2610507
> >
> > On Wed, Aug 24, 2022 at 10:00 AM Jayjeet Chakraborty <
> > jayjeetchakrabort...@gmail.com> wrote:
> >
> > > Hi Arrow Community,
> > >
> > > With the release of Acero, we were wondering if Acero can be used in a
> > > distributed environment as for now it looks like Acero is only
> intended for
> > > a local context. For example, if we have a query plan with a hash join
> node
> > > at the root and multiple filter project nodes on each sides of the
> tree,
> > > each side having a data source, how can we distribute the query plan
> > > between 3 nodes: 2 nodes containing data sources and executing the
> filter
> > > and project parts of the query plan in parallel while 1 node serving
> as the
> > > compute node, performing only the join operation on the results from
> the
> > > other 2 nodes. As per my understanding, we need some form of RPC
> mechanism
> > > between the ExecNodes of an ExecPlan and would probably be integrated
> > > within the Flight framework. Is that the right way to think about it ?
> Do
> > > you think that is something the Arrow community would be interested in
> if
> > > not already planning for it ? Thanks.
> > >
> > > Jayjeet Chakraborty
> > >
> > >
> > > --
> > > *Jayjeet Chakraborty*
> > > CS PhD student
> > > UC Santa Cruz
> > > California, USA
> > >
> >
> >
> > --
> > Niranda Perera
> > https://niranda.dev/
> > @n1r44 <https://twitter.com/N1R44>
>


-- 
*Jayjeet Chakraborty*
CS PhD student
UC Santa Cruz
California, USA

Reply via email to