Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Antoine Pitrou Wed, 06 Dec 2023 07:40:45 -0800

Hi,

While this looks like a nice start, I would expect more preciserecommendations for writing non-trivial services. Especially, onequestion is how to send both an application-specific POST request and anArrow stream, or an application-specific GET response and an Arrowstream. This might necessitate some kind of framing layer, or astandardized delimiter.


Regards

Antoine.



Le 05/12/2023 à 21:10, Ian Cook a écrit :

This is a continuation of the discussion entitled "[DISCUSS] Protocol for
exchanging Arrow data over REST APIs". See the previous messages at
https://lists.apache.org/thread/vfz74gv1knnhjdkro47shzd1z5g5ggnf.

To inform this discussion, I created some basic Arrow-over-HTTP client and
server examples here:
https://github.com/apache/arrow/pull/39081

My intention is to expand and improve this set of examples (with your help)
until they reflect a set of conventions that we are comfortable documenting
as recommendations.

Please take a look and add comments / suggestions in the PR.

Thanks,
Ian

On Tue, Nov 21, 2023 at 1:35 PM Dewey Dunnington
<[email protected]> wrote:

I also think a set of best practices for Arrow over HTTP would be a
valuable resource for the community...even if it never becomes a
specification of its own, it will be beneficial for API developers and
consumers of those APIs to have a place to look to understand how
Arrow can help improve throughput/latency/maybe other things. Possibly
something like httpbin.org but for requests/responses that use Arrow
would be helpful as well. Thank you Ian for leading this effort!

It has mostly been covered already, but in the (ubiquitous) situation
where a response contains some schema/table and some non-schema/table
information there is some tension between throughput (best served by a
JSON response plus one or more IPC stream responses) and latency (best
served by a single HTTP response? JSON? IPC with metadata/header?). In
addition to Antoine's list, I would add:

- How to serve the same table in multiple requests (e.g., to saturate
a network connection, or because separate worker nodes are generating
results anyway).
- How to inline a small schema/table into a single request with other
metadata (I have seen this done as base64-encoded IPC in JSON, but
perhaps there is a better way)

If anybody is interested in experimenting, I repurposed a previous
experiment I had as a flask app that can stream IPC to a client:

https://github.com/paleolimbot/2023-11-21_arrow-over-http-scratchpad/pull/1/files
.

- recommendations about compression


Just a note that there is also Content-Encoding: gzip (for consumers
like Arrow JS that don't currently support buffer compression but that
can leverage the facilities of the browser/http library)

Cheers!

-dewey


On Mon, Nov 20, 2023 at 8:30 PM Sutou Kouhei <[email protected]> wrote:

Hi,

But how is the performance?


It's faster than the original JSON based API.

I implemented Apache Arrow support for a C# client. So I
measured only with Apache Arrow C# but the Apache Arrow
based API is faster than JSON based API.

Have you measured the throughput of this approach to see
if it is comparable to using Flight SQL?


Sorry. I didn't measure the throughput. In the case, elapsed
time of one request/response pair is important than
throughput. And it was faster than JSON based API and enough
performance.

I couldn't compare to a Flight SQL based approach because
Groonga doesn't support Flight SQL yet.

Is this approach able to saturate a fast network
connection?


I think that we can't measure this with the Groonga case
because the Groonga case doesn't send data without
stopping. Here is one of request patterns:

1. Groonga has log data partitioned by day
2. Groonga does full text search against one partition (2023-11-01)
3. Groonga sends the result to client as Apache Arrow
    streaming format record batches
4. Groonga does full text search against the next partition (2023-11-02)
5. Groonga sends the result to client as Apache Arrow
    streaming format record batches
6. ...

In the case, the result data aren't always sending. (search
-> send -> search -> send -> ...) So it doesn't saturate a
fast network connection.

(3. and 4. can be parallel but it's not implemented yet.)

If we optimize this approach, this approach may be able to
saturate a fast network connection.

And what about the case in which the server wants to begin sending

batches

to the client before the total number of result batches / records is

known?


Ah, sorry. I forgot to explain the case. Groonga uses the
above approach for it.

- server should not return the result data in the body of a response

to a

query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data


If we want to do this, the standard "Location" HTTP headers
may be suitable.

- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size


Ah, sorry. I forgot to explain this case too. Groonga uses
"Transfer-Encoding: chunked". But recommended chunk size may
be case-by-case... If a server can produce enough data as
fast as possible, larger chunk size may be
faster. Otherwise, larger chunk size may be slower.

- support range requests (Accept-Range: bytes) to allow clients to

request

result ranges (or not?)


In the Groonga case, it's not supported. Because Groonga
drops the result after one request/response pair. Groonga
can't return only the specified range result after the
response is returned.

- recommendations about compression


In the case that network is the bottleneck, LZ4 or Zstandard
compression will improve total performance.

- recommendations about TCP receive window size
- recommendation to open multiple TCP connections on very fast networks
(e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck


HTTP/3 may be better for these cases.


Thanks,
--
kou

In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
   "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on

Sat, 18 Nov 2023 13:51:53 -0500,

   Ian Cook <[email protected]> wrote:

Hi Kou,

I think it is too early to make a specific proposal. I hope to use this
discussion to collect more information about existing approaches. If
several viable approaches emerge from this discussion, then I think we
should make a document listing them, like you suggest.

Thank you for the information about Groonga. This type of

straightforward

HTTP-based approach would work in the context of a REST API, as I
understand it.

But how is the performance? Have you measured the throughput of this
approach to see if it is comparable to using Flight SQL? Is this

approach

able to saturate a fast network connection?

And what about the case in which the server wants to begin sending

batches

to the client before the total number of result batches / records is

known?

Would this approach work in that case? I think so but I am not sure.

If this HTTP-based type of approach is sufficiently performant and it

works

in a sufficient proportion of the envisioned use cases, then perhaps

the

proposed spec / protocol could be based on this approach. If so, then

we

could refocus this discussion on which best practices to incorporate /
recommend, such as:
- server should not return the result data in the body of a response

to a

query request; instead server should return a response body that gives
URI(s) at which clients can GET the result data
- transmit result data in chunks (Transfer-Encoding: chunked), with
recommendations about chunk size
- support range requests (Accept-Range: bytes) to allow clients to

request

result ranges (or not?)
- recommendations about compression
- recommendations about TCP receive window size
- recommendation to open multiple TCP connections on very fast networks
(e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck

On the other hand, if the performance and functionality of this

HTTP-based

type of approach is not sufficient, then we might consider

fundamentally

different approaches.

Ian

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

Reply via email to