Hi Li,

It'd depend on how exactly you expect everything to fit together, and I think 
the way you'd go about it would depend on what exactly the application is. For 
instance, you could have the application code do everything up through DoGet 
and get a reader, then create a SourceNode from the reader and continue from 
there.

Otherwise, I would think the way to go would be to be able to create a node 
from a FlightDescriptor (which would contain the URL/parameters in your 
example). In that case, I think it'd fit into Arrow Dataset, under ARROW-10524 
[1]. In that case, I'd equate GetFlightInfo to dataset discovery, and each 
FlightEndpoint in the FlightInfo to a Fragment. As a bonus, there's already 
good integration between Dataset and Acero and this should naturally do things 
like read the FlightEndpoints in parallel with readahead and so on.

That means: you'd start with the FlightDescriptor, and create a Dataset from 
it. This will call GetFlightInfo under the hood. (There's a minor catch here: 
this assumes the service that returns the FlightInfo can embed an accurate 
schema into it. If that's not true, there'll have to be some finagling with 
various ways of getting the actual schema, depending on what exactly your 
service supports.) Once you have a Dataset, you can create an ExecPlan and 
proceed like normal.

Of course, if you then want to get things into Python, R, Substrait, etc... 
that requires some more work - especially for Substrait where I'm not sure how 
best to encode a custom source like that.

[1]: https://issues.apache.org/jira/browse/ARROW-10524

-David

On Wed, Aug 31, 2022, at 17:09, Li Jin wrote:
> Hello!
>
> I have recently started to look into integrating Flight RPC with Acero
> source/sink node.
>
> In Flight, the life cycle of a "read" request looks sth like:
>
>    - User specifies a URL (e.g. my_storage://my_path) and parameter (e.g.,
>    begin = "20220101", end = "20220201")
>    - Client issue GetFlightInfo and get FlightInfo from server
>    - Client issue DoGet with the FlightInfo and get a stream reader
>    - Client calls Nextuntil stream is exhausted
>
> My question is, how does the above life cycle fit in an Acero node? In
> other words, what are the proper places in Acero node lifecycle to issue
> the corresponding flight RPC?
>
> Appreciate any thoughts,
> Li

Reply via email to