Re: Horizontal scaling design suggestion: Apache arrow flight

Vinay Kesarwani Fri, 25 Oct 2019 06:08:19 -0700

Hi Ryan,

Thanks for your quick response.


I am aligned with your references and would like to discuss further to take
it forward.

Thanks,
Vinay

On Fri, Oct 18, 2019 at 11:51 PM Ryan Murray <[email protected]> wrote:

> Hey Vinay,
>
> This Spark source might be of interest [1]. We had discussed the
> possibility of it being moved into Arrow proper as a contrib module when
> more stable.
>
> This is doing something similar to what you are suggesting: talking to a
> cluster of Flight servers from Spark. This deals more with the client side
> and less with the server side however. It talks to a single Flight
> 'coordinator' and uses getSchema/getFlightInfo to tell the coordinator it
> wants a particular dataset. The coordinator then gives a list of flight
> tickets with portions of the required datasets. A client can a) ask for the
> entire dataset from the coordinator b) iterate serially through the tickets
> and assemble the whole dataset on the client side or (in the case of the
> Spark connector) fetch tickets in parallel.
>
> I think the server side as you described above doesn't yet exist in a
> standalone form although the spark connector was developed in conjunction
> with [2] as the server. This is however highly dependent on the
> implementation details of the Dremio engine as it is taking care of the
> coordination between the flight workers. The idea is identical to yours
> however: a coordinator engine, a distributed store for engine meta, worker
> engines which create/serve the Arrow buffers.
>
> Would be happy to discuss further if you are interested in working on this
> stuff!
>
> Best,
> Ryan
>
> [1] https://github.com/rymurr/flight-spark-source
> [2] https://github.com/dremio-hub/dremio-flight-connector
>
> On Fri, Oct 18, 2019 at 3:05 PM Vinay Kesarwani <[email protected]>
> wrote:
>
> > Hi,
> >
> > I am trying to establish following architecture
> >
> > My approach for flight horizontal scaling is to launch
> > 1-Apache flight server in each node
> > 2-one node declared as coordinator
> > 3-Publish coordinator info to a shared service [zookeeper]
> > 4-Launch worker node --> get coordinator node info from [zookeeper]
> > 5-Worker publishes its info to [zookeeper] to consumed by others
> >
> > Client connects to coordinator:
> > 1- Calls getFlightInfo(desc)
> > 2-Here Co-coordinator node overrides getFlightInfo()
> > 3-getFlightInfo() method internally get worker info based on the
> descriptor
> > from zookeeper
> > 4-Client consumes data from each endpoint in iterative manner OR in
> > parallel[not sure how]
> > -->getData()
> >
> > PutData()
> > 5-Client calls putdata() to put data in different nodes in flight stream
> > 6-Iterate through the endpoints and matches worker node IP
> > 7-if Worker IP matches with endpoint; worker put data in that node flight
> > server.
> > 8-On putting any new stream/updated; worker node info is updated in
> > zookeeper
> > 9-In case worker IP doesn't match with the endpoint we need to put data
> in
> > any other worker node; and publish the info in zookeeper.
> >
> > [in future distributed-client and distributed end point] example: spark
> > workers to Apache arrow flight cluster
> >
> > [image: image]
> > <
> >
> https://user-images.githubusercontent.com/6141965/67092386-b0012c00-f1cc-11e9-9ce2-d657001a85f7.png
> > >
> >
> > Just wanted to discuss if any PR is in progress for horizontal scaling in
> > Arrow flight, or any design doc is under discussion.
> >
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | [email protected]
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>

Re: Horizontal scaling design suggestion: Apache arrow flight

Reply via email to