I also think a set of best practices for Arrow over HTTP would be a
valuable resource for the community...even if it never becomes a
specification of its own, it will be beneficial for API developers and
consumers of those APIs to have a place to look to understand how
Arrow can help improve throughput/latency/maybe other things. Possibly
something like httpbin.org but for requests/responses that use Arrow
would be helpful as well. Thank you Ian for leading this effort!
It has mostly been covered already, but in the (ubiquitous) situation
where a response contains some schema/table and some non-schema/table
information there is some tension between throughput (best served by a
JSON response plus one or more IPC stream responses) and latency (best
served by a single HTTP response? JSON? IPC with metadata/header?). In
addition to Antoine's list, I would add:
- How to serve the same table in multiple requests (e.g., to saturate
a network connection, or because separate worker nodes are generating
results anyway).
- How to inline a small schema/table into a single request with other
metadata (I have seen this done as base64-encoded IPC in JSON, but
perhaps there is a better way)
If anybody is interested in experimenting, I repurposed a previous
experiment I had as a flask app that can stream IPC to a client:
https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
.
- recommendations about compression
Just a note that there is also Content-Encoding: gzip (for consumers
like Arrow JS that don't currently support buffer compression but that
can leverage the facilities of the browser/http library)
Cheers!
-dewey
On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <k...@clear-code.com> wrote:
Hi,
But how is the performance?
It's faster than the original JSON based API.
I implemented Apache Arrow support for a C# client. So I
measured only with Apache Arrow C# but the Apache Arrow
based API is faster than JSON based API.
Have you measured the throughput of this approach to see
if it is comparable to using Flight SQL?
Sorry. I didn't measure the throughput. In the case, elapsed
time of one request/response pair is important than
throughput. And it was faster than JSON based API and enough
performance.
I couldn't compare to a Flight SQL based approach because
Groonga doesn't support Flight SQL yet.
Is this approach able to saturate a fast network
connection?
I think that we can't measure this with the Groonga case
because the Groonga case doesn't send data without
stopping. Here is one of request patterns:
1. Groonga has log data partitioned by day
2. Groonga does full text search against one partition (2023-11-01)
3. Groonga sends the result to client as Apache Arrow
streaming format record batches
4. Groonga does full text search against the next partition (2023-11-02)
5. Groonga sends the result to client as Apache Arrow
streaming format record batches
6. ...
In the case, the result data aren't always sending. (search
-> send -> search -> send -> ...) So it doesn't saturate a
fast network connection.
(3. and 4. can be parallel but it's not implemented yet.)
If we optimize this approach, this approach may be able to
saturate a fast network connection.
And what about the case in which the server wants to begin sending
batches
to the client before the total number of result batches / records is
known?
Ah, sorry. I forgot to explain the case. Groonga uses the
above approach for it.
- server should not return the result data in the body of a response
to a
query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data
If we want to do this, the standard "Location" HTTP headers
may be suitable.
- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size
Ah, sorry. I forgot to explain this case too. Groonga uses
"Transfer-Encoding: chunked". But recommended chunk size may
be case-by-case... If a server can produce enough data as
fast as possible, larger chunk size may be
faster. Otherwise, larger chunk size may be slower.
- support range requests (Accept-Range: bytes) to allow clients to
request
result ranges (or not?)
In the Groonga case, it's not supported. Because Groonga
drops the result after one request/response pair. Groonga
can't return only the specified range result after the
response is returned.
- recommendations about compression
In the case that network is the bottleneck, LZ4 or Zstandard
compression will improve total performance.
- recommendations about TCP receive window size
- recommendation to open multiple TCP connections on very fast networks
(e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
HTTP/3 may be better for these cases.
Thanks,
--
kou
In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
"Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on
Sat, 18 Nov 2023 13:51:53 -0500,
Ian Cook <ianmc...@apache.org> wrote:
Hi Kou,
I think it is too early to make a specific proposal. I hope to use this
discussion to collect more information about existing approaches. If
several viable approaches emerge from this discussion, then I think we
should make a document listing them, like you suggest.
Thank you for the information about Groonga. This type of
straightforward
HTTP-based approach would work in the context of a REST API, as I
understand it.
But how is the performance? Have you measured the throughput of this
approach to see if it is comparable to using Flight SQL? Is this
approach
able to saturate a fast network connection?
And what about the case in which the server wants to begin sending
batches
to the client before the total number of result batches / records is
known?
Would this approach work in that case? I think so but I am not sure.
If this HTTP-based type of approach is sufficiently performant and it
works
in a sufficient proportion of the envisioned use cases, then perhaps
the
proposed spec / protocol could be based on this approach. If so, then
we
could refocus this discussion on which best practices to incorporate /
recommend, such as:
- server should not return the result data in the body of a response
to a
query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data
- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size
- support range requests (Accept-Range: bytes) to allow clients to
request
result ranges (or not?)
- recommendations about compression
- recommendations about TCP receive window size
- recommendation to open multiple TCP connections on very fast networks
(e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
On the other hand, if the performance and functionality of this
HTTP-based
type of approach is not sufficient, then we might consider
fundamentally
different approaches.
Ian