Hi Ryan, Thanks for your quick response.
I am aligned with your references and would like to discuss further to take it forward. Thanks, Vinay On Fri, Oct 18, 2019 at 11:51 PM Ryan Murray <rym...@dremio.com> wrote: > Hey Vinay, > > This Spark source might be of interest [1]. We had discussed the > possibility of it being moved into Arrow proper as a contrib module when > more stable. > > This is doing something similar to what you are suggesting: talking to a > cluster of Flight servers from Spark. This deals more with the client side > and less with the server side however. It talks to a single Flight > 'coordinator' and uses getSchema/getFlightInfo to tell the coordinator it > wants a particular dataset. The coordinator then gives a list of flight > tickets with portions of the required datasets. A client can a) ask for the > entire dataset from the coordinator b) iterate serially through the tickets > and assemble the whole dataset on the client side or (in the case of the > Spark connector) fetch tickets in parallel. > > I think the server side as you described above doesn't yet exist in a > standalone form although the spark connector was developed in conjunction > with [2] as the server. This is however highly dependent on the > implementation details of the Dremio engine as it is taking care of the > coordination between the flight workers. The idea is identical to yours > however: a coordinator engine, a distributed store for engine meta, worker > engines which create/serve the Arrow buffers. > > Would be happy to discuss further if you are interested in working on this > stuff! > > Best, > Ryan > > [1] https://github.com/rymurr/flight-spark-source > [2] https://github.com/dremio-hub/dremio-flight-connector > > On Fri, Oct 18, 2019 at 3:05 PM Vinay Kesarwani <vnkesarw...@gmail.com> > wrote: > > > Hi, > > > > I am trying to establish following architecture > > > > My approach for flight horizontal scaling is to launch > > 1-Apache flight server in each node > > 2-one node declared as coordinator > > 3-Publish coordinator info to a shared service [zookeeper] > > 4-Launch worker node --> get coordinator node info from [zookeeper] > > 5-Worker publishes its info to [zookeeper] to consumed by others > > > > Client connects to coordinator: > > 1- Calls getFlightInfo(desc) > > 2-Here Co-coordinator node overrides getFlightInfo() > > 3-getFlightInfo() method internally get worker info based on the > descriptor > > from zookeeper > > 4-Client consumes data from each endpoint in iterative manner OR in > > parallel[not sure how] > > -->getData() > > > > PutData() > > 5-Client calls putdata() to put data in different nodes in flight stream > > 6-Iterate through the endpoints and matches worker node IP > > 7-if Worker IP matches with endpoint; worker put data in that node flight > > server. > > 8-On putting any new stream/updated; worker node info is updated in > > zookeeper > > 9-In case worker IP doesn't match with the endpoint we need to put data > in > > any other worker node; and publish the info in zookeeper. > > > > [in future distributed-client and distributed end point] example: spark > > workers to Apache arrow flight cluster > > > > [image: image] > > < > > > https://user-images.githubusercontent.com/6141965/67092386-b0012c00-f1cc-11e9-9ce2-d657001a85f7.png > > > > > > > Just wanted to discuss if any PR is in progress for horizontal scaling in > > Arrow flight, or any design doc is under discussion. > > > > > -- > > Ryan Murray | Principal Consulting Engineer > > +447540852009 | rym...@dremio.com > > <https://www.dremio.com/> > Check out our GitHub <https://www.github.com/dremio>, join our community > site <https://community.dremio.com/> & Download Dremio > <https://www.dremio.com/download> >