Thank you Kou, Gavin, David, Antoine, and Raphael.

It sounds like there is agreement on the following:

- There is broad interest in the topic of how best to transfer Arrow data
over HTTP.
- Focusing only on REST-style APIs is too limiting; we should scope this to
be about HTTP more broadly. (This does *not* include gRPC or WebSocket.)
- The type of asset that would best serve this purpose is a conventions
document, i.e. an informal specification. We should *not* try to create a
formal specification or protocol for this, at least not yet.
- We should establish the encapsulated IPC message format as the data
payload format. (This corresponds to "Content-Type:
application/vnd.apache.arrow.stream")
- We should specify how to send and receive both single IPC messages and
streams of IPC messages.
- We should provide some conventions for HTTP headers and for non-Arrow
metadata (although some of these might be more of suggestions than
conventions).

Based on this, I will draft a skeleton document and circulate it here for
input. In the meantime, more perspectives are appreciated.

Ian

On Mon, Nov 20, 2023 at 10:31 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> I really like the idea of leveraging the mature ecosystem support for
> IPC streams [1] to provide a set of conventions for sending and
> receiving arrow data over plain HTTP.
>
> For context, myself and my colleagues have run into a number of pain
> points whilst working on FlightSQL:
>
> - The additional indirection via opaque Arrow Flight payloads somewhat
> undermines the value of using an IDL. In arrow-rs we've had to introduce
> custom abstractions [1] to workaround this
> - The gRPC imposed message size limits are tricky to accommodate, and
> require non-trivial workarounds [2]
> - The FlightData abstraction leaks a lot of IPC details into clients,
> which are fiddly to get correct. Again arrow-rs has added abstractions
> [3] to workaround this
> - HTTP/2 keep-alives don't work over reverse proxies, as PING frames are
> not associated with a particular stream [5][6]
>
> I therefore think providing a set of conventions for designing protocols
> operating over plain HTTP would be very compelling, as such protocols
> wouldn't encounter these pain points. This would also open the door to
> protocols making use of HTTP/3 where supported, eliminating a number of
> the issues inherent to TCP.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> [1]:
> https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
> [2]:
>
> https://docs.rs/arrow-flight/latest/arrow_flight/sql/server/trait.FlightSqlService.html
> [3]:
>
> https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size
> [4]:
>
> https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoder.html
> [5]:
>
> https://github.com/microsoft/reverse-proxy/issues/118#issuecomment-940191553
> [6]:
>
> https://kubernetes.github.io/ingress-nginx/examples/grpc/#notes-on-using-responserequest-streams
>
> On 20/11/2023 14:23, David Li wrote:
> > I'm with Kou: what exactly are we trying to specify?
> >
> > - The HTTP mapping of Flight RPC?
> > - A full, locked down RPC framework like Flight RPC, but otherwise
> unrelated?
> > - Something else?
> >
> > I'd also ask: do we need to specify anything in the first place? What is
> stopping people from using Arrow in their REST APIs, and what kind of
> interoperability are we trying to achieve? I would say that Flight RPC
> effectively has no interoperability at all - each project using it has its
> own bespoke layers on top, and the "standardized" RPC methods just hinder
> the applications that would like more control and flexibility that Flight
> RPC does not provide. The recent additions to the Flight RPC spec speak to
> that: they were meant for Flight SQL, but needed to be implemented at the
> Flight RPC layer; there is not a real abstraction layer that Flight RPC
> really serves.
> >
> >> It could consist only of a specification for how to implement
> >> support for exchanging Arrow-formatted data in an existing REST API.
> > I would say that this is the only part that might make sense: once a
> client has acquired an Arrow-aware endpoint, what should be the format of
> the Arrow data it gets (whether this is just the Arrow stream format, or
> something fancier like FlightData in Flight RPC).
> >
> > Separately, it might make sense to define how GraphQL works with Arrow,
> or other specific, full protocols/APIs. But I'm not sure there's much room
> for a Flight RPC equivalent for HTTP/1, if Flight RPC on its own really
> ever made sense as a full framework/protocol in the first place.
> >
> > On Sat, Nov 18, 2023, at 14:17, Gavin Ray wrote:
> >> I know that myself and a number of folks I work with would be
> interested in
> >> this.
> >>
> >> gRPC is a bit of a barrier for a lot of services.
> >> Having a spec for doing Arrow over HTTP API's would be solid.
> >>
> >> In my opinion, it doesn't necessarily need to be REST-ful.
> >> Something like JSON-RPC might fit well with the existing model for Arrow
> >> over the wire that's been implemented in things like Flight/FlightSQL.
> >>
> >> Something else I've been interested in (I think Matt Topol has done
> work in
> >> this area) is Arrow over GraphQL, too:
> >> GraphQL and Apache Arrow: A Match Made in Data (youtube.com)
> >> <https://www.youtube.com/watch?v=5N97TzY_tis>
> >>
> >> On Sat, Nov 18, 2023 at 1:52 PM Ian Cook <ianmc...@apache.org> wrote:
> >>
> >>> Hi Kou,
> >>>
> >>> I think it is too early to make a specific proposal. I hope to use this
> >>> discussion to collect more information about existing approaches. If
> >>> several viable approaches emerge from this discussion, then I think we
> >>> should make a document listing them, like you suggest.
> >>>
> >>> Thank you for the information about Groonga. This type of
> straightforward
> >>> HTTP-based approach would work in the context of a REST API, as I
> >>> understand it.
> >>>
> >>> But how is the performance? Have you measured the throughput of this
> >>> approach to see if it is comparable to using Flight SQL? Is this
> approach
> >>> able to saturate a fast network connection?
> >>>
> >>> And what about the case in which the server wants to begin sending
> batches
> >>> to the client before the total number of result batches / records is
> known?
> >>> Would this approach work in that case? I think so but I am not sure.
> >>>
> >>> If this HTTP-based type of approach is sufficiently performant and it
> works
> >>> in a sufficient proportion of the envisioned use cases, then perhaps
> the
> >>> proposed spec / protocol could be based on this approach. If so, then
> we
> >>> could refocus this discussion on which best practices to incorporate /
> >>> recommend, such as:
> >>> - server should not return the result data in the body of a response
> to a
> >>> query request; instead server should return a response body that gives
> >>> URI(s) at which clients can GET the result data
> >>> - transmit result data in chunks (Transfer-Encoding: chunked), with
> >>> recommendations about chunk size
> >>> - support range requests (Accept-Range: bytes) to allow clients to
> request
> >>> result ranges (or not?)
> >>> - recommendations about compression
> >>> - recommendations about TCP receive window size
> >>> - recommendation to open multiple TCP connections on very fast networks
> >>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
> >>>
> >>> On the other hand, if the performance and functionality of this
> HTTP-based
> >>> type of approach is not sufficient, then we might consider
> fundamentally
> >>> different approaches.
> >>>
> >>> Ian
> >>>
>

Reply via email to