Antoine, Thank you for taking a look. I agree—these are basic examples intended to prove the concept and answer fundamental questions. Next I intend to expand the set of examples to cover more complex cases.
> This might necessitate some kind of framing layer, or a > standardized delimiter. I am interested to hear more perspectives on this. My perspective is that we should recommend using HTTP conventions to keep clean separation between the Arrow-formatted binary data payloads and the various application-specific fields. This can be achieved by encoding application-specific fields in URI paths, query parameters, headers, or separate parts of multipart/form-data messages. Ian On Wed, Dec 6, 2023 at 1:24 PM Antoine Pitrou <anto...@python.org> wrote: > > > Hi, > > While this looks like a nice start, I would expect more precise > recommendations for writing non-trivial services. Especially, one > question is how to send both an application-specific POST request and an > Arrow stream, or an application-specific GET response and an Arrow > stream. This might necessitate some kind of framing layer, or a > standardized delimiter. > > Regards > > Antoine. > > > > Le 05/12/2023 à 21:10, Ian Cook a écrit : > > This is a continuation of the discussion entitled "[DISCUSS] Protocol for > > exchanging Arrow data over REST APIs". See the previous messages at > > https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf. > > > > To inform this discussion, I created some basic Arrow-over-HTTP client and > > server examples here: > > https://github.com/apache/arrow/pull/39081 > > > > My intention is to expand and improve this set of examples (with your help) > > until they reflect a set of conventions that we are comfortable documenting > > as recommendations. > > > > Please take a look and add comments / suggestions in the PR. > > > > Thanks, > > Ian > > > > On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington > > <de...@voltrondata.com.invalid> wrote: > > > >> I also think a set of best practices for Arrow over HTTP would be a > >> valuable resource for the community...even if it never becomes a > >> specification of its own, it will be beneficial for API developers and > >> consumers of those APIs to have a place to look to understand how > >> Arrow can help improve throughput/latency/maybe other things. Possibly > >> something like httpbin.org but for requests/responses that use Arrow > >> would be helpful as well. Thank you Ian for leading this effort! > >> > >> It has mostly been covered already, but in the (ubiquitous) situation > >> where a response contains some schema/table and some non-schema/table > >> information there is some tension between throughput (best served by a > >> JSON response plus one or more IPC stream responses) and latency (best > >> served by a single HTTP response? JSON? IPC with metadata/header?). In > >> addition to Antoine's list, I would add: > >> > >> - How to serve the same table in multiple requests (e.g., to saturate > >> a network connection, or because separate worker nodes are generating > >> results anyway). > >> - How to inline a small schema/table into a single request with other > >> metadata (I have seen this done as base64-encoded IPC in JSON, but > >> perhaps there is a better way) > >> > >> If anybody is interested in experimenting, I repurposed a previous > >> experiment I had as a flask app that can stream IPC to a client: > >> > >> https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files > >> . > >> > >>> - recommendations about compression > >> > >> Just a note that there is also Content-Encoding: gzip (for consumers > >> like Arrow JS that don't currently support buffer compression but that > >> can leverage the facilities of the browser/http library) > >> > >> Cheers! > >> > >> -dewey > >> > >> > >> On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com> wrote: > >>> > >>> Hi, > >>> > >>>> But how is the performance? > >>> > >>> It's faster than the original JSON based API. > >>> > >>> I implemented Apache Arrow support for a C# client. So I > >>> measured only with Apache Arrow C# but the Apache Arrow > >>> based API is faster than JSON based API. > >>> > >>>> Have you measured the throughput of this approach to see > >>>> if it is comparable to using Flight SQL? > >>> > >>> Sorry. I didn't measure the throughput. In the case, elapsed > >>> time of one request/response pair is important than > >>> throughput. And it was faster than JSON based API and enough > >>> performance. > >>> > >>> I couldn't compare to a Flight SQL based approach because > >>> Groonga doesn't support Flight SQL yet. > >>> > >>>> Is this approach able to saturate a fast network > >>>> connection? > >>> > >>> I think that we can't measure this with the Groonga case > >>> because the Groonga case doesn't send data without > >>> stopping. Here is one of request patterns: > >>> > >>> 1. Groonga has log data partitioned by day > >>> 2. Groonga does full text search against one partition (2023-11-01) > >>> 3. Groonga sends the result to client as Apache Arrow > >>> streaming format record batches > >>> 4. Groonga does full text search against the next partition (2023-11-02) > >>> 5. Groonga sends the result to client as Apache Arrow > >>> streaming format record batches > >>> 6. ... > >>> > >>> In the case, the result data aren't always sending. (search > >>> -> send -> search -> send -> ...) So it doesn't saturate a > >>> fast network connection. > >>> > >>> (3. and 4. can be parallel but it's not implemented yet.) > >>> > >>> If we optimize this approach, this approach may be able to > >>> saturate a fast network connection. > >>> > >>>> And what about the case in which the server wants to begin sending > >> batches > >>>> to the client before the total number of result batches / records is > >> known? > >>> > >>> Ah, sorry. I forgot to explain the case. Groonga uses the > >>> above approach for it. > >>> > >>>> - server should not return the result data in the body of a response > >> to a > >>>> query request; instead server should return a response body that gives > >>>> URI(s) at which clients can GET the result data > >>> > >>> If we want to do this, the standard "Location" HTTP headers > >>> may be suitable. > >>> > >>>> - transmit result data in chunks (Transfer-Encoding: chunked), with > >>>> recommendations about chunk size > >>> > >>> Ah, sorry. I forgot to explain this case too. Groonga uses > >>> "Transfer-Encoding: chunked". But recommended chunk size may > >>> be case-by-case... If a server can produce enough data as > >>> fast as possible, larger chunk size may be > >>> faster. Otherwise, larger chunk size may be slower. > >>> > >>>> - support range requests (Accept-Range: bytes) to allow clients to > >> request > >>>> result ranges (or not?) > >>> > >>> In the Groonga case, it's not supported. Because Groonga > >>> drops the result after one request/response pair. Groonga > >>> can't return only the specified range result after the > >>> response is returned. > >>> > >>>> - recommendations about compression > >>> > >>> In the case that network is the bottleneck, LZ4 or Zstandard > >>> compression will improve total performance. > >>> > >>>> - recommendations about TCP receive window size > >>>> - recommendation to open multiple TCP connections on very fast networks > >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > >>> > >>> HTTP/3 may be better for these cases. > >>> > >>> > >>> Thanks, > >>> -- > >>> kou > >>> > >>> In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com> > >>> "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on > >> Sat, 18 Nov 2023 13:51:53 -0500, > >>> Ian Cook <ianmc...@apache.org> wrote: > >>> > >>>> Hi Kou, > >>>> > >>>> I think it is too early to make a specific proposal. I hope to use this > >>>> discussion to collect more information about existing approaches. If > >>>> several viable approaches emerge from this discussion, then I think we > >>>> should make a document listing them, like you suggest. > >>>> > >>>> Thank you for the information about Groonga. This type of > >> straightforward > >>>> HTTP-based approach would work in the context of a REST API, as I > >>>> understand it. > >>>> > >>>> But how is the performance? Have you measured the throughput of this > >>>> approach to see if it is comparable to using Flight SQL? Is this > >> approach > >>>> able to saturate a fast network connection? > >>>> > >>>> And what about the case in which the server wants to begin sending > >> batches > >>>> to the client before the total number of result batches / records is > >> known? > >>>> Would this approach work in that case? I think so but I am not sure. > >>>> > >>>> If this HTTP-based type of approach is sufficiently performant and it > >> works > >>>> in a sufficient proportion of the envisioned use cases, then perhaps > >> the > >>>> proposed spec / protocol could be based on this approach. If so, then > >> we > >>>> could refocus this discussion on which best practices to incorporate / > >>>> recommend, such as: > >>>> - server should not return the result data in the body of a response > >> to a > >>>> query request; instead server should return a response body that gives > >>>> URI(s) at which clients can GET the result data > >>>> - transmit result data in chunks (Transfer-Encoding: chunked), with > >>>> recommendations about chunk size > >>>> - support range requests (Accept-Range: bytes) to allow clients to > >> request > >>>> result ranges (or not?) > >>>> - recommendations about compression > >>>> - recommendations about TCP receive window size > >>>> - recommendation to open multiple TCP connections on very fast networks > >>>> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > >>>> > >>>> On the other hand, if the performance and functionality of this > >> HTTP-based > >>>> type of approach is not sufficient, then we might consider > >> fundamentally > >>>> different approaches. > >>>> > >>>> Ian > >> > >