Hi, > But how is the performance?
It's faster than the original JSON based API. I implemented Apache Arrow support for a C# client. So I measured only with Apache Arrow C# but the Apache Arrow based API is faster than JSON based API. > Have you measured the throughput of this approach to see > if it is comparable to using Flight SQL? Sorry. I didn't measure the throughput. In the case, elapsed time of one request/response pair is important than throughput. And it was faster than JSON based API and enough performance. I couldn't compare to a Flight SQL based approach because Groonga doesn't support Flight SQL yet. > Is this approach able to saturate a fast network > connection? I think that we can't measure this with the Groonga case because the Groonga case doesn't send data without stopping. Here is one of request patterns: 1. Groonga has log data partitioned by day 2. Groonga does full text search against one partition (2023-11-01) 3. Groonga sends the result to client as Apache Arrow streaming format record batches 4. Groonga does full text search against the next partition (2023-11-02) 5. Groonga sends the result to client as Apache Arrow streaming format record batches 6. ... In the case, the result data aren't always sending. (search -> send -> search -> send -> ...) So it doesn't saturate a fast network connection. (3. and 4. can be parallel but it's not implemented yet.) If we optimize this approach, this approach may be able to saturate a fast network connection. > And what about the case in which the server wants to begin sending batches > to the client before the total number of result batches / records is known? Ah, sorry. I forgot to explain the case. Groonga uses the above approach for it. > - server should not return the result data in the body of a response to a > query request; instead server should return a response body that gives > URI(s) at which clients can GET the result data If we want to do this, the standard "Location" HTTP headers may be suitable. > - transmit result data in chunks (Transfer-Encoding: chunked), with > recommendations about chunk size Ah, sorry. I forgot to explain this case too. Groonga uses "Transfer-Encoding: chunked". But recommended chunk size may be case-by-case... If a server can produce enough data as fast as possible, larger chunk size may be faster. Otherwise, larger chunk size may be slower. > - support range requests (Accept-Range: bytes) to allow clients to request > result ranges (or not?) In the Groonga case, it's not supported. Because Groonga drops the result after one request/response pair. Groonga can't return only the specified range result after the response is returned. > - recommendations about compression In the case that network is the bottleneck, LZ4 or Zstandard compression will improve total performance. > - recommendations about TCP receive window size > - recommendation to open multiple TCP connections on very fast networks > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck HTTP/3 may be better for these cases. Thanks, -- kou In <CANa9GTHuXBBkn-=uevmbr2edmiyquunc6qdqdvh7gpeps9c...@mail.gmail.com> "Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs" on Sat, 18 Nov 2023 13:51:53 -0500, Ian Cook <ianmc...@apache.org> wrote: > Hi Kou, > > I think it is too early to make a specific proposal. I hope to use this > discussion to collect more information about existing approaches. If > several viable approaches emerge from this discussion, then I think we > should make a document listing them, like you suggest. > > Thank you for the information about Groonga. This type of straightforward > HTTP-based approach would work in the context of a REST API, as I > understand it. > > But how is the performance? Have you measured the throughput of this > approach to see if it is comparable to using Flight SQL? Is this approach > able to saturate a fast network connection? > > And what about the case in which the server wants to begin sending batches > to the client before the total number of result batches / records is known? > Would this approach work in that case? I think so but I am not sure. > > If this HTTP-based type of approach is sufficiently performant and it works > in a sufficient proportion of the envisioned use cases, then perhaps the > proposed spec / protocol could be based on this approach. If so, then we > could refocus this discussion on which best practices to incorporate / > recommend, such as: > - server should not return the result data in the body of a response to a > query request; instead server should return a response body that gives > URI(s) at which clients can GET the result data > - transmit result data in chunks (Transfer-Encoding: chunked), with > recommendations about chunk size > - support range requests (Accept-Range: bytes) to allow clients to request > result ranges (or not?) > - recommendations about compression > - recommendations about TCP receive window size > - recommendation to open multiple TCP connections on very fast networks > (e.g. >25 Gbps) where a CPU thread could be the throughput bottleneck > > On the other hand, if the performance and functionality of this HTTP-based > type of approach is not sufficient, then we might consider fundamentally > different approaches. > > Ian