Thank you Kou, Gavin, David, Antoine, and Raphael. It sounds like there is agreement on the following:
- There is broad interest in the topic of how best to transfer Arrow data over HTTP. - Focusing only on REST-style APIs is too limiting; we should scope this to be about HTTP more broadly. (This does *not* include gRPC or WebSocket.) - The type of asset that would best serve this purpose is a conventions document, i.e. an informal specification. We should *not* try to create a formal specification or protocol for this, at least not yet. - We should establish the encapsulated IPC message format as the data payload format. (This corresponds to "Content-Type: application/vnd.apache.arrow.stream") - We should specify how to send and receive both single IPC messages and streams of IPC messages. - We should provide some conventions for HTTP headers and for non-Arrow metadata (although some of these might be more of suggestions than conventions). Based on this, I will draft a skeleton document and circulate it here for input. In the meantime, more perspectives are appreciated. Ian On Mon, Nov 20, 2023 at 10:31 AM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > I really like the idea of leveraging the mature ecosystem support for > IPC streams [1] to provide a set of conventions for sending and > receiving arrow data over plain HTTP. > > For context, myself and my colleagues have run into a number of pain > points whilst working on FlightSQL: > > - The additional indirection via opaque Arrow Flight payloads somewhat > undermines the value of using an IDL. In arrow-rs we've had to introduce > custom abstractions [1] to workaround this > - The gRPC imposed message size limits are tricky to accommodate, and > require non-trivial workarounds [2] > - The FlightData abstraction leaks a lot of IPC details into clients, > which are fiddly to get correct. Again arrow-rs has added abstractions > [3] to workaround this > - HTTP/2 keep-alives don't work over reverse proxies, as PING frames are > not associated with a particular stream [5][6] > > I therefore think providing a set of conventions for designing protocols > operating over plain HTTP would be very compelling, as such protocols > wouldn't encounter these pain points. This would also open the door to > protocols making use of HTTP/3 where supported, eliminating a number of > the issues inherent to TCP. > > Kind Regards, > > Raphael Taylor-Davies > > [1]: > https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format > [2]: > > https://docs.rs/arrow-flight/latest/arrow_flight/sql/server/trait.FlightSqlService.html > [3]: > > https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size > [4]: > > https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoder.html > [5]: > > https://github.com/microsoft/reverse-proxy/issues/118#issuecomment-940191553 > [6]: > > https://kubernetes.github.io/ingress-nginx/examples/grpc/#notes-on-using-responserequest-streams > > On 20/11/2023 14:23, David Li wrote: > > I'm with Kou: what exactly are we trying to specify? > > > > - The HTTP mapping of Flight RPC? > > - A full, locked down RPC framework like Flight RPC, but otherwise > unrelated? > > - Something else? > > > > I'd also ask: do we need to specify anything in the first place? What is > stopping people from using Arrow in their REST APIs, and what kind of > interoperability are we trying to achieve? I would say that Flight RPC > effectively has no interoperability at all - each project using it has its > own bespoke layers on top, and the "standardized" RPC methods just hinder > the applications that would like more control and flexibility that Flight > RPC does not provide. The recent additions to the Flight RPC spec speak to > that: they were meant for Flight SQL, but needed to be implemented at the > Flight RPC layer; there is not a real abstraction layer that Flight RPC > really serves. > > > >> It could consist only of a specification for how to implement > >> support for exchanging Arrow-formatted data in an existing REST API. > > I would say that this is the only part that might make sense: once a > client has acquired an Arrow-aware endpoint, what should be the format of > the Arrow data it gets (whether this is just the Arrow stream format, or > something fancier like FlightData in Flight RPC). > > > > Separately, it might make sense to define how GraphQL works with Arrow, > or other specific, full protocols/APIs. But I'm not sure there's much room > for a Flight RPC equivalent for HTTP/1, if Flight RPC on its own really > ever made sense as a full framework/protocol in the first place. > > > > On Sat, Nov 18, 2023, at 14:17, Gavin Ray wrote: > >> I know that myself and a number of folks I work with would be > interested in > >> this. > >> > >> gRPC is a bit of a barrier for a lot of services. > >> Having a spec for doing Arrow over HTTP API's would be solid. > >> > >> In my opinion, it doesn't necessarily need to be REST-ful. > >> Something like JSON-RPC might fit well with the existing model for Arrow > >> over the wire that's been implemented in things like Flight/FlightSQL. > >> > >> Something else I've been interested in (I think Matt Topol has done > work in > >> this area) is Arrow over GraphQL, too: > >> GraphQL and Apache Arrow: A Match Made in Data (youtube.com) > >> <https://www.youtube.com/watch?v=5N97TzY_tis> > >> > >> On Sat, Nov 18, 2023 at 1:52 PM Ian Cook <ianmc...@apache.org> wrote: > >> > >>> Hi Kou, > >>> > >>> I think it is too early to make a specific proposal. I hope to use this > >>> discussion to collect more information about existing approaches. If > >>> several viable approaches emerge from this discussion, then I think we > >>> should make a document listing them, like you suggest. > >>> > >>> Thank you for the information about Groonga. This type of > straightforward > >>> HTTP-based approach would work in the context of a REST API, as I > >>> understand it. > >>> > >>> But how is the performance? Have you measured the throughput of this > >>> approach to see if it is comparable to using Flight SQL? Is this > approach > >>> able to saturate a fast network connection? > >>> > >>> And what about the case in which the server wants to begin sending > batches > >>> to the client before the total number of result batches / records is > known? > >>> Would this approach work in that case? I think so but I am not sure. > >>> > >>> If this HTTP-based type of approach is sufficiently performant and it > works > >>> in a sufficient proportion of the envisioned use cases, then perhaps > the > >>> proposed spec / protocol could be based on this approach. If so, then > we > >>> could refocus this discussion on which best practices to incorporate / > >>> recommend, such as: > >>> - server should not return the result data in the body of a response > to a > >>> query request; instead server should return a response body that gives > >>> URI(s) at which clients can GET the result data > >>> - transmit result data in chunks (Transfer-Encoding: chunked), with > >>> recommendations about chunk size > >>> - support range requests (Accept-Range: bytes) to allow clients to > request > >>> result ranges (or not?) > >>> - recommendations about compression > >>> - recommendations about TCP receive window size > >>> - recommendation to open multiple TCP connections on very fast networks > >>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > >>> > >>> On the other hand, if the performance and functionality of this > HTTP-based > >>> type of approach is not sufficient, then we might consider > fundamentally > >>> different approaches. > >>> > >>> Ian > >>> >