Hey Yibo,

Thanks for investigating this! This is a great writeup.

There was a PR recently to let clients set gRPC options like this, so
it can be enabled on a case-by-case basis:
https://github.com/apache/arrow/pull/7406
So we could add that to the benchmark or suggest it in documentation.

I think this benchmark is a bit of a pathological case for gRPC. gRPC
will share sockets when all client options are exactly the same; it
seems just adding TLS, for instance, would break that (unless you
intentionally shared TLS credentials, which Flight doesn't):
https://github.com/grpc/grpc/issues/15207. I believe grpc-java doesn't
have this behavior (different Channel instances won't share
connections).

Also, did you investigate the SO_ZEROCOPY flag gRPC now offers? I
wonder if that might also help performance a bit.
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga1eb58c302eaf27a5d982b30402b8f84a

Best,
David

On 6/17/20, Chengxin Ma <c...@protonmail.ch.invalid> wrote:
> Hi Yibo,
>
>
> Your discovery is impressive.
>
>
> Did you consider the `num_streams` parameter [1] as well? If I understood
> correctly, this parameter is used for setting the conceptual concurrent
> streams between the client and the server, while `num_threads` is used for
> setting the size of the thread pool that actually handles these streams [2].
> By default, both of the two parameters are 4.
>
>
> As for CPU usage, the parameter `records_per_batch`[3] has an impact as
> well. If you increase the value of this parameter, you will probably see
> that the data transfer speed increased while the server-side CPU usage
> dropped [4].
> My guess is that as more records are put in one record batch, the total
> number of batches would decrease. CPU is only used for (de)serializing the
> metadata (i.e. schema) of each record batch while the payload can be
> transferred with zero cost [5].
>
>
> [1] 
> https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
> [2] 
> https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
> [3] 
> https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
> [4] 
> https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
> [5] See "Optimizing Data Throughput over gRPC"
> in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
>
>
> Kind Regards
> Chengxin
>
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <yibo....@arm.com> wrote:
>
>> Find a way to achieve reasonable benchmark result with multiple threads.
>> Diff pasted below for a quick review or try.
>> Tested on E5-2650, with this change:
>> num_threads = 1, speed = 1996
>> num_threads = 2, speed = 3555
>> num_threads = 4, speed = 5828
>>
>> When running `arrow_flight_benchmark`, I find there's only one TCP
>> connection between client and server, no matter what `num_threads` is. All
>> clients share one TCP connection. At server side, I see only one thread is
>> processing network packets. On my machine, one client already saturates a
>> CPU core, so it becomes worse when `num_threads` increase, as that single
>> server thread becomes bottleneck.
>>
>> If running in standalone mode, flight clients are from different processes
>> and have their own TCP connections to the server. There're separated
>> server threads handling network traffics for each connection, without a
>> central bottleneck.
>>
>> I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before
>> give up. Setting that arg makes each client establishes its own TCP
>> connection to the server, similar to standalone mode.
>>
>> Actually, I'm not quite sure if we should set this arg. Sharing one TCP
>> connection is a reasonable configuration, and it's an advantage of
>> gRPC[2].
>>
>> Per my test, most CPU cycles are spent in kernel mode doing networking and
>> data transfer. Maybe better solution is to leverage modern network
>> techniques like RDMA or user mode stack for higher performance.
>>
>> [1]
>> https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
>> [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5
>>
>> diff --git a/cpp/src/arrow/flight/client.cc
>> b/cpp/src/arrow/flight/client.cc
>> index d530093d9..6904640d3 100644
>> --- a/cpp/src/arrow/flight/client.cc
>> +++ b/cpp/src/arrow/flight/client.cc
>> @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
>> args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
>> // Receive messages of any size
>> args.SetMaxReceiveMessageSize(-1);
>>
>> -   // Setting this arg enables each client to open it's own TCP
>> connection to server,
>> -   // not sharing one single connection, which becomes bottleneck under
>> high load.
>> -   args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
>>
>>     if (options.override_hostname != "") {
>>     args.SetSslTargetNameOverride(options.override_hostname);
>>
>>     On 6/15/20 10:00 PM, Wes McKinney wrote:
>>
>>
>> > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou anto...@python.org
>> > wrote:
>> >
>> > > Le 15/06/2020 à 15:36, Wes McKinney a écrit :
>> > >
>> > > > When you have only a single server, all the gRPC traffic goes
>> > > > through
>> > > > a common port and is handled by a common server, so if both client
>> > > > and
>> > > > server are roughly IO bound you aren't going to get better
>> > > > performance
>> > > > by hitting the server with multiple clients simultaneously, only
>> > > > worse
>> > > > because the packets from different client requests are intermingled
>> > > > in
>> > > > the TCP traffic on that port. I'm not a networking expert but this
>> > > > is
>> > > > my best understanding of what is going on.
>> > >
>> > > Yibo Cai's experiment disproves that explanation, though.
>> > > When I run a single client against the test server, I get ~4 GB/s.
>> > > When
>> > > I run 6 standalone clients against the same test server, I get ~8
>> > > GB/s
>> > > aggregate. So there's something else going on that limits scalability
>> > > when the benchmark executable runs all clients by itself (perhaps
>> > > gRPC
>> > > clients in a single process share some underlying structure or
>> > > execution
>> > > threads? I don't know).
>> >
>> > I see, thanks. OK then clearly something else is going on.
>> >
>> > > > I hope someone will implement the "multiple test servers" TODO in
>> > > > the
>> > > > benchmark.
>> > >
>> > > I think that's a bad idea in any case, as running multiple servers on
>> > > different ports is not a realistic expectation from users.
>> > > Regards
>> > > Antoine.
>
>
>

Reply via email to