Hello,
I've been trying to saturate several CPU cores using our Flight benchmark (which spawns a server process and attempts to communicate with it using multiple clients), but haven't managed to. The typical command-line I'm executing is the following: $ time taskset -c 1,3,5,7 ./build/release/arrow-flight-benchmark -records_per_stream 50000000 -num_streams 16 -num_threads 32 -records_per_batch 120000 Breakdown: - "time": I want to get CPU user / system / wall-clock times - "taskset -c ...": I have a 8-core 16-threads machine and I want to allow scheduling RPC threads on 4 distinct physical cores - "-records_per_stream": I want each stream to have enough records so that connection / stream setup costs are negligible - "-num_streams": this is the number of streams the benchmark tries to download (DoGet()) from the server to the client - "-num_threads": this is the number of client threads the benchmark makes download requests from. Since our client is currently blocking, it makes sense to have a large number of client threads (to allow overlap). Note that each thread creates a separate gRPC client and connection. - "-records_per_batch": transfer enough records per individual RPC message, to minimize overhead. This number brings us close to the default gRPC message limit of 4 MB. The results I get look like: Bytes read: 25600000000 Nanos: 8433804781 Speed: 2894.79 MB/s real 0m8,569s user 0m6,085s sys 0m15,667s If we divide (user + sys) by real, we conclude that 2.5 cores are saturated by this benchmark. Evidently, this means that the benchmark is waiting a *lot*. The question is: where? Here is some things I looked at: - mutex usage inside Arrow. None seems to pop up (printf is my friend). - number of threads used by the gRPC server. gRPC implicitly spawns a number of threads to handle incoming client requests. I've checked (using printf...) that several threads are indeed used to serve incoming connections. - CPU usage bottlenecks. 80% of the entire benchmark's CPU time is spent in memcpy() calls in the *client* (precisely, in the grpc_byte_buffer_reader_readall() call inside arrow::flight::internal::FlightDataDeserialize()). It doesn't look like the server is the bottleneck. - the benchmark connects to "localhost". I've changed it to "127.0.0.1", it doesn't make a difference. AFAIK, localhost TCP connections should be well-optimized on Linux. It seems highly unlikely that they would incur idle waiting times (rather than CPU time processing packets). - RAM usage. It's quite reasonable at 220 MB (client) + 75 MB (server). No swapping occurs. - Disk I/O. "vmstat" tells me no block I/O happens during the benchmark. - As a reference, I can transfer 5 GB/s over a single TCP connection using plain sockets in a simple Python script. 3 GB/s over multiple connections doesn't look terrific. So it looks like there's a scalability issue inside our current Flight code, or perhaps inside gRPC. The benchmark itself, if simplistic, doesn't look problematic; it should actually be kind of a best case, especially with the above parameters. Does anyone have any clues or ideas? In particular, is there a simple way to diagnose *where* exactly the waiting times happen? Regards Antoine.