Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Ian Cook Wed, 06 Dec 2023 10:45:53 -0800

Antoine,

Thank you for taking a look. I agree—these are basic examples intended
to prove the concept and answer fundamental questions. Next I intend
to expand the set of examples to cover more complex cases.


> This might necessitate some kind of framing layer, or a
> standardized delimiter.

I am interested to hear more perspectives on this. My perspective is
that we should recommend using HTTP conventions to keep clean
separation between the Arrow-formatted binary data payloads and the
various application-specific fields. This can be achieved by encoding
application-specific fields in URI paths, query parameters, headers,
or separate parts of multipart/form-data messages.

Ian

On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hi,
>
> While this looks like a nice start, I would expect more precise
> recommendations for writing non-trivial services. Especially, one
> question is how to send both an application-specific POST request and an
> Arrow stream, or an application-specific GET response and an Arrow
> stream. This might necessitate some kind of framing layer, or a
> standardized delimiter.
>
> Regards
>
> Antoine.
>
>
>
> Le 05/12/2023 à 21:10, Ian Cook a écrit :
> > This is a continuation of the discussion entitled "[DISCUSS] Protocol for
> > exchanging Arrow data over REST APIs". See the previous messages at
> > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.
> >
> > To inform this discussion, I created some basic Arrow-over-HTTP client and
> > server examples here:
> > https://github.com/apache/arrow/pull/39081
> >
> > My intention is to expand and improve this set of examples (with your help)
> > until they reflect a set of conventions that we are comfortable documenting
> > as recommendations.
> >
> > Please take a look and add comments / suggestions in the PR.
> >
> > Thanks,
> > Ian
> >
> > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington
> > <de...@voltrondata.com.invalid> wrote:
> >
> >> I also think a set of best practices for Arrow over HTTP would be a
> >> valuable resource for the community...even if it never becomes a
> >> specification of its own, it will be beneficial for API developers and
> >> consumers of those APIs to have a place to look to understand how
> >> Arrow can help improve throughput/latency/maybe other things. Possibly
> >> something like httpbin.org but for requests/responses that use Arrow
> >> would be helpful as well. Thank you Ian for leading this effort!
> >>
> >> It has mostly been covered already, but in the (ubiquitous) situation
> >> where a response contains some schema/table and some non-schema/table
> >> information there is some tension between throughput (best served by a
> >> JSON response plus one or more IPC stream responses) and latency (best
> >> served by a single HTTP response? JSON? IPC with metadata/header?). In
> >> addition to Antoine's list, I would add:
> >>
> >> - How to serve the same table in multiple requests (e.g., to saturate
> >> a network connection, or because separate worker nodes are generating
> >> results anyway).
> >> - How to inline a small schema/table into a single request with other
> >> metadata (I have seen this done as base64-encoded IPC in JSON, but
> >> perhaps there is a better way)
> >>
> >> If anybody is interested in experimenting, I repurposed a previous
> >> experiment I had as a flask app that can stream IPC to a client:
> >>
> >> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
> >> .
> >>
> >>> - recommendations about compression
> >>
> >> Just a note that there is also Content-Encoding: gzip (for consumers
> >> like Arrow JS that don't currently support buffer compression but that
> >> can leverage the facilities of the browser/http library)
> >>
> >> Cheers!
> >>
> >> -dewey
> >>
> >>
> >> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>>> But how is the performance?
> >>>
> >>> It's faster than the original JSON based API.
> >>>
> >>> I implemented Apache Arrow support for a C# client. So I
> >>> measured only with Apache Arrow C# but the Apache Arrow
> >>> based API is faster than JSON based API.
> >>>
> >>>> Have you measured the throughput of this approach to see
> >>>> if it is comparable to using Flight SQL?
> >>>
> >>> Sorry. I didn't measure the throughput. In the case, elapsed
> >>> time of one request/response pair is important than
> >>> throughput. And it was faster than JSON based API and enough
> >>> performance.
> >>>
> >>> I couldn't compare to a Flight SQL based approach because
> >>> Groonga doesn't support Flight SQL yet.
> >>>
> >>>> Is this approach able to saturate a fast network
> >>>> connection?
> >>>
> >>> I think that we can't measure this with the Groonga case
> >>> because the Groonga case doesn't send data without
> >>> stopping. Here is one of request patterns:
> >>>
> >>> 1. Groonga has log data partitioned by day
> >>> 2. Groonga does full text search against one partition (2023-11-01)
> >>> 3. Groonga sends the result to client as Apache Arrow
> >>>     streaming format record batches
> >>> 4. Groonga does full text search against the next partition (2023-11-02)
> >>> 5. Groonga sends the result to client as Apache Arrow
> >>>     streaming format record batches
> >>> 6. ...
> >>>
> >>> In the case, the result data aren't always sending. (search
> >>> -> send -> search -> send -> ...) So it doesn't saturate a
> >>> fast network connection.
> >>>
> >>> (3. and 4. can be parallel but it's not implemented yet.)
> >>>
> >>> If we optimize this approach, this approach may be able to
> >>> saturate a fast network connection.
> >>>
> >>>> And what about the case in which the server wants to begin sending
> >> batches
> >>>> to the client before the total number of result batches / records is
> >> known?
> >>>
> >>> Ah, sorry. I forgot to explain the case. Groonga uses the
> >>> above approach for it.
> >>>
> >>>> - server should not return the result data in the body of a response
> >> to a
> >>>> query request; instead server should return a response body that gives
> >>>> URI(s) at which clients can GET the result data
> >>>
> >>> If we want to do this, the standard "Location" HTTP headers
> >>> may be suitable.
> >>>
> >>>> - transmit result data in chunks (Transfer-Encoding: chunked), with
> >>>> recommendations about chunk size
> >>>
> >>> Ah, sorry. I forgot to explain this case too. Groonga uses
> >>> "Transfer-Encoding: chunked". But recommended chunk size may
> >>> be case-by-case... If a server can produce enough data as
> >>> fast as possible, larger chunk size may be
> >>> faster. Otherwise, larger chunk size may be slower.
> >>>
> >>>> - support range requests (Accept-Range: bytes) to allow clients to
> >> request
> >>>> result ranges (or not?)
> >>>
> >>> In the Groonga case, it's not supported. Because Groonga
> >>> drops the result after one request/response pair. Groonga
> >>> can't return only the specified range result after the
> >>> response is returned.
> >>>
> >>>> - recommendations about compression
> >>>
> >>> In the case that network is the bottleneck, LZ4 or Zstandard
> >>> compression will improve total performance.
> >>>
> >>>> - recommendations about TCP receive window size
> >>>> - recommendation to open multiple TCP connections on very fast networks
> >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
> >>>
> >>> HTTP/3 may be better for these cases.
> >>>
> >>>
> >>> Thanks,
> >>> --
> >>> kou
> >>>
> >>> In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
> >>>    "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on
> >> Sat, 18 Nov 2023 13:51:53 -0500,
> >>>    Ian Cook <ianmc...@apache.org> wrote:
> >>>
> >>>> Hi Kou,
> >>>>
> >>>> I think it is too early to make a specific proposal. I hope to use this
> >>>> discussion to collect more information about existing approaches. If
> >>>> several viable approaches emerge from this discussion, then I think we
> >>>> should make a document listing them, like you suggest.
> >>>>
> >>>> Thank you for the information about Groonga. This type of
> >> straightforward
> >>>> HTTP-based approach would work in the context of a REST API, as I
> >>>> understand it.
> >>>>
> >>>> But how is the performance? Have you measured the throughput of this
> >>>> approach to see if it is comparable to using Flight SQL? Is this
> >> approach
> >>>> able to saturate a fast network connection?
> >>>>
> >>>> And what about the case in which the server wants to begin sending
> >> batches
> >>>> to the client before the total number of result batches / records is
> >> known?
> >>>> Would this approach work in that case? I think so but I am not sure.
> >>>>
> >>>> If this HTTP-based type of approach is sufficiently performant and it
> >> works
> >>>> in a sufficient proportion of the envisioned use cases, then perhaps
> >> the
> >>>> proposed spec / protocol could be based on this approach. If so, then
> >> we
> >>>> could refocus this discussion on which best practices to incorporate /
> >>>> recommend, such as:
> >>>> - server should not return the result data in the body of a response
> >> to a
> >>>> query request; instead server should return a response body that gives
> >>>> URI(s) at which clients can GET the result data
> >>>> - transmit result data in chunks (Transfer-Encoding: chunked), with
> >>>> recommendations about chunk size
> >>>> - support range requests (Accept-Range: bytes) to allow clients to
> >> request
> >>>> result ranges (or not?)
> >>>> - recommendations about compression
> >>>> - recommendations about TCP receive window size
> >>>> - recommendation to open multiple TCP connections on very fast networks
> >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
> >>>>
> >>>> On the other hand, if the performance and functionality of this
> >> HTTP-based
> >>>> type of approach is not sufficient, then we might consider
> >> fundamentally
> >>>> different approaches.
> >>>>
> >>>> Ian
> >>
> >

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Reply via email to