Hey Yibo, Thanks for investigating this! This is a great writeup.
There was a PR recently to let clients set gRPC options like this, so it can be enabled on a case-by-case basis: https://github.com/apache/arrow/pull/7406 So we could add that to the benchmark or suggest it in documentation. I think this benchmark is a bit of a pathological case for gRPC. gRPC will share sockets when all client options are exactly the same; it seems just adding TLS, for instance, would break that (unless you intentionally shared TLS credentials, which Flight doesn't): https://github.com/grpc/grpc/issues/15207. I believe grpc-java doesn't have this behavior (different Channel instances won't share connections). Also, did you investigate the SO_ZEROCOPY flag gRPC now offers? I wonder if that might also help performance a bit. https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga1eb58c302eaf27a5d982b30402b8f84a Best, David On 6/17/20, Chengxin Ma <c...@protonmail.ch.invalid> wrote: > Hi Yibo, > > > Your discovery is impressive. > > > Did you consider the `num_streams` parameter [1] as well? If I understood > correctly, this parameter is used for setting the conceptual concurrent > streams between the client and the server, while `num_threads` is used for > setting the size of the thread pool that actually handles these streams [2]. > By default, both of the two parameters are 4. > > > As for CPU usage, the parameter `records_per_batch`[3] has an impact as > well. If you increase the value of this parameter, you will probably see > that the data transfer speed increased while the server-side CPU usage > dropped [4]. > My guess is that as more records are put in one record batch, the total > number of batches would decrease. CPU is only used for (de)serializing the > metadata (i.e. schema) of each record batch while the payload can be > transferred with zero cost [5]. > > > [1] > https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43 > [2] > https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230 > [3] > https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46 > [4] > https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing > [5] See "Optimizing Data Throughput over gRPC" > in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ > > > Kind Regards > Chengxin > > > Sent with ProtonMail Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <yibo....@arm.com> wrote: > >> Find a way to achieve reasonable benchmark result with multiple threads. >> Diff pasted below for a quick review or try. >> Tested on E5-2650, with this change: >> num_threads = 1, speed = 1996 >> num_threads = 2, speed = 3555 >> num_threads = 4, speed = 5828 >> >> When running `arrow_flight_benchmark`, I find there's only one TCP >> connection between client and server, no matter what `num_threads` is. All >> clients share one TCP connection. At server side, I see only one thread is >> processing network packets. On my machine, one client already saturates a >> CPU core, so it becomes worse when `num_threads` increase, as that single >> server thread becomes bottleneck. >> >> If running in standalone mode, flight clients are from different processes >> and have their own TCP connections to the server. There're separated >> server threads handling network traffics for each connection, without a >> central bottleneck. >> >> I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before >> give up. Setting that arg makes each client establishes its own TCP >> connection to the server, similar to standalone mode. >> >> Actually, I'm not quite sure if we should set this arg. Sharing one TCP >> connection is a reasonable configuration, and it's an advantage of >> gRPC[2]. >> >> Per my test, most CPU cycles are spent in kernel mode doing networking and >> data transfer. Maybe better solution is to leverage modern network >> techniques like RDMA or user mode stack for higher performance. >> >> [1] >> https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc >> [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5 >> >> diff --git a/cpp/src/arrow/flight/client.cc >> b/cpp/src/arrow/flight/client.cc >> index d530093d9..6904640d3 100644 >> --- a/cpp/src/arrow/flight/client.cc >> +++ b/cpp/src/arrow/flight/client.cc >> @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl { >> args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100); >> // Receive messages of any size >> args.SetMaxReceiveMessageSize(-1); >> >> - // Setting this arg enables each client to open it's own TCP >> connection to server, >> - // not sharing one single connection, which becomes bottleneck under >> high load. >> - args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1); >> >> if (options.override_hostname != "") { >> args.SetSslTargetNameOverride(options.override_hostname); >> >> On 6/15/20 10:00 PM, Wes McKinney wrote: >> >> >> > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou anto...@python.org >> > wrote: >> > >> > > Le 15/06/2020 à 15:36, Wes McKinney a écrit : >> > > >> > > > When you have only a single server, all the gRPC traffic goes >> > > > through >> > > > a common port and is handled by a common server, so if both client >> > > > and >> > > > server are roughly IO bound you aren't going to get better >> > > > performance >> > > > by hitting the server with multiple clients simultaneously, only >> > > > worse >> > > > because the packets from different client requests are intermingled >> > > > in >> > > > the TCP traffic on that port. I'm not a networking expert but this >> > > > is >> > > > my best understanding of what is going on. >> > > >> > > Yibo Cai's experiment disproves that explanation, though. >> > > When I run a single client against the test server, I get ~4 GB/s. >> > > When >> > > I run 6 standalone clients against the same test server, I get ~8 >> > > GB/s >> > > aggregate. So there's something else going on that limits scalability >> > > when the benchmark executable runs all clients by itself (perhaps >> > > gRPC >> > > clients in a single process share some underlying structure or >> > > execution >> > > threads? I don't know). >> > >> > I see, thanks. OK then clearly something else is going on. >> > >> > > > I hope someone will implement the "multiple test servers" TODO in >> > > > the >> > > > benchmark. >> > > >> > > I think that's a bad idea in any case, as running multiple servers on >> > > different ports is not a realistic expectation from users. >> > > Regards >> > > Antoine. > > >