Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

Sutou Kouhei Mon, 20 Nov 2023 16:30:37 -0800

Hi,

> But how is the performance?

It's faster than the original JSON based API.

I implemented Apache Arrow support for a C# client. So I
measured only with Apache Arrow C# but the Apache Arrow
based API is faster than JSON based API.

> Have you measured the throughput of this approach to see
> if it is comparable to using Flight SQL?

Sorry. I didn't measure the throughput. In the case, elapsed
time of one request/response pair is important than
throughput. And it was faster than JSON based API and enough
performance.

I couldn't compare to a Flight SQL based approach because
Groonga doesn't support Flight SQL yet.

> Is this approach able to saturate a fast network
> connection?

I think that we can't measure this with the Groonga case
because the Groonga case doesn't send data without
stopping. Here is one of request patterns:

1. Groonga has log data partitioned by day
2. Groonga does full text search against one partition (2023-11-01)
3. Groonga sends the result to client as Apache Arrow
   streaming format record batches
4. Groonga does full text search against the next partition (2023-11-02)
5. Groonga sends the result to client as Apache Arrow
   streaming format record batches
6. ...

In the case, the result data aren't always sending. (search
-> send -> search -> send -> ...) So it doesn't saturate a
fast network connection.

(3. and 4. can be parallel but it's not implemented yet.)

If we optimize this approach, this approach may be able to
saturate a fast network connection.

> And what about the case in which the server wants to begin sending batches
> to the client before the total number of result batches / records is known?

Ah, sorry. I forgot to explain the case. Groonga uses the
above approach for it.

> - server should not return the result data in the body of a response to a
> query request; instead server should return a response body that gives
> URI(s) at which clients can GET the result data

If we want to do this, the standard "Location" HTTP headers
may be suitable.

> - transmit result data in chunks (Transfer-Encoding: chunked), with
> recommendations about chunk size

Ah, sorry. I forgot to explain this case too. Groonga uses
"Transfer-Encoding: chunked". But recommended chunk size may
be case-by-case... If a server can produce enough data as
fast as possible, larger chunk size may be
faster. Otherwise, larger chunk size may be slower.

> - support range requests (Accept-Range: bytes) to allow clients to request
> result ranges (or not?)

In the Groonga case, it's not supported. Because Groonga
drops the result after one request/response pair. Groonga
can't return only the specified range result after the
response is returned.

> - recommendations about compression

In the case that network is the bottleneck, LZ4 or Zstandard
compression will improve total performance.

> - recommendations about TCP receive window size
> - recommendation to open multiple TCP connections on very fast networks
> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck

HTTP/3 may be better for these cases.

Thanks,
-- 
kou

In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com>
  "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on Sat, 18 
Nov 2023 13:51:53 -0500,
  Ian Cook <ianmc...@apache.org> wrote:

> Hi Kou,
> 
> I think it is too early to make a specific proposal. I hope to use this
> discussion to collect more information about existing approaches. If
> several viable approaches emerge from this discussion, then I think we
> should make a document listing them, like you suggest.
> 
> Thank you for the information about Groonga. This type of straightforward
> HTTP-based approach would work in the context of a REST API, as I
> understand it.
> 
> But how is the performance? Have you measured the throughput of this
> approach to see if it is comparable to using Flight SQL? Is this approach
> able to saturate a fast network connection?
> 
> And what about the case in which the server wants to begin sending batches
> to the client before the total number of result batches / records is known?
> Would this approach work in that case? I think so but I am not sure.
> 
> If this HTTP-based type of approach is sufficiently performant and it works
> in a sufficient proportion of the envisioned use cases, then perhaps the
> proposed spec / protocol could be based on this approach. If so, then we
> could refocus this discussion on which best practices to incorporate /
> recommend, such as:
> - server should not return the result data in the body of a response to a
> query request; instead server should return a response body that gives
> URI(s) at which clients can GET the result data
> - transmit result data in chunks (Transfer-Encoding: chunked), with
> recommendations about chunk size
> - support range requests (Accept-Range: bytes) to allow clients to request
> result ranges (or not?)
> - recommendations about compression
> - recommendations about TCP receive window size
> - recommendation to open multiple TCP connections on very fast networks
> (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck
> 
> On the other hand, if the performance and functionality of this HTTP-based
> type of approach is not sufficient, then we might consider fundamentally
> different approaches.
> 
> Ian

Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

Reply via email to