Hey Joris, Your plan sounds right for Flight. As for semantics:
The descriptor and ticket format are mostly application defined. For instance, I think some places (Dremio?) just put a raw SQL query as the "cmd" of a descriptor; putting serialized JSON or Protobuf is also certainly fine. I'd say implementing _every_ endpoint isn't required - we don't use ListFlights for instance. In terms of what you described, I'd map a descriptor to a query, and a Flight to its execution; calling GetFlightInfo would return each worker in its own FlightEndpoint, and the Ticket would be something agreed upon by your coordinator and worker (e.g. the request and time range). For docs, have you seen this? https://arrow.apache.org/docs/format/Flight.html While it's labeled "Format", it contains an example of a Flight request flow. Best, David On 6/23/20, Joris Peeters <joris.mg.peet...@gmail.com> wrote: > Hello, > > I'm interested in using Flight for serving large amounts of data in a > parallelised manner, and just building some Python prototypes, based on > https://github.com/apache/arrow/blob/apache-arrow-0.17.1/python/examples/flight > > In my use-case, we'd have a bunch of worker servers, serving a number of > different datasets (here called "datasetA" and "datasetB"), but also some > additional parameters to customise a single query (eg a date range if the > dataset is a time series, but can be other stuff too - depending on the > dataset). > > The idea is for clients to hit a single coordinator with their entire query > (eg datasetA + [1970,2020]), and then getting instructed to hit a variety > of workers, with slices of this, e.g. {worker1: (datasetA, [1970, 1990)), > worker2: (datasetA, [1990-2020])}. I.e. I want to chunk up the original > request in a few smaller ones, to be handled by different workers, which > then retrieve the data from a DB and send it back to the client, which > aggregates. > > Although I'm proto-typing from Python, this should work from a variety of > platforms. > Does that sound like something Flight should be able to do well? > > If so - what are the intended semantics for the descriptor and ticket etc, > based on my previous example? I see idioms for "path" and "cmd" etc, but > neither really seems to fit. My query is more like some opaque JSON, e.g. > something you'd submit to an HTTP server. Is the idea to send a > string-serialisation of e.g: > > { > "dataset": "datasetA", > "dateFrom": "1970-01-01", > "dateTo": "2020-06-23" > }? > > In that case, what should listFlights return, given that the queries are > dynamic? Something like, > ["datasetA", "datasetB", ...] ? > > I guess I'm mainly struggling to understand what a descriptor, ticket and > flight really are, within my context - and can't really find it in the > docs. > Just a link to some good docs would obviously be great as well! I'm hitting > https://arrow.apache.org/docs/python/api/flight.html which is largely > empty. It does say "Flight is currently not distributed as part of wheels > or in Conda - it is only available when built from source appropriately." > which seems a bit pessimistic, as it appears present in both the pypi and > conda 0.17.1 package I checked. > > Cheers, > -Joris. >