Hello,

I've been trying to saturate several CPU cores using our Flight
benchmark (which spawns a server process and attempts to communicate
with it using multiple clients), but haven't managed to.

The typical command-line I'm executing is the following:

$ time taskset -c 1,3,5,7  ./build/release/arrow-flight-benchmark
-records_per_stream 50000000 -num_streams 16 -num_threads 32
-records_per_batch 120000

Breakdown:

- "time": I want to get CPU user / system / wall-clock times

- "taskset -c ...": I have a 8-core 16-threads machine and I want to
  allow scheduling RPC threads on 4 distinct physical cores

- "-records_per_stream": I want each stream to have enough records so
  that connection / stream setup costs are negligible

- "-num_streams": this is the number of streams the benchmark tries to
  download (DoGet()) from the server to the client

- "-num_threads": this is the number of client threads the benchmark
  makes download requests from.  Since our client is currently
  blocking, it makes sense to have a large number of client threads (to
  allow overlap).  Note that each thread creates a separate gRPC client
  and connection.

- "-records_per_batch": transfer enough records per individual RPC
  message, to minimize overhead.  This number brings us close to the
  default gRPC message limit of 4 MB.

The results I get look like:

Bytes read: 25600000000
Nanos: 8433804781
Speed: 2894.79 MB/s

real    0m8,569s
user    0m6,085s
sys     0m15,667s


If we divide (user + sys) by real, we conclude that 2.5 cores are
saturated by this benchmark.  Evidently, this means that the benchmark
is waiting a *lot*.  The question is: where?

Here is some things I looked at:

- mutex usage inside Arrow.  None seems to pop up (printf is my friend).

- number of threads used by the gRPC server.  gRPC implicitly spawns a
  number of threads to handle incoming client requests.  I've checked
  (using printf...) that several threads are indeed used to serve
  incoming connections.

- CPU usage bottlenecks.  80% of the entire benchmark's CPU time is
  spent in memcpy() calls in the *client* (precisely, in the
  grpc_byte_buffer_reader_readall() call inside
  arrow::flight::internal::FlightDataDeserialize()).  It doesn't look
  like the server is the bottleneck.

- the benchmark connects to "localhost".  I've changed it to
  "127.0.0.1", it doesn't make a difference.  AFAIK, localhost TCP
  connections should be well-optimized on Linux.  It seems highly
  unlikely that they would incur idle waiting times (rather than CPU
  time processing packets).

- RAM usage.  It's quite reasonable at 220 MB (client) + 75 MB
  (server).  No swapping occurs.

- Disk I/O.  "vmstat" tells me no block I/O happens during the
  benchmark.

- As a reference, I can transfer 5 GB/s over a single TCP connection
  using plain sockets in a simple Python script.  3 GB/s over multiple
  connections doesn't look terrific.


So it looks like there's a scalability issue inside our current Flight
code, or perhaps inside gRPC.  The benchmark itself, if simplistic,
doesn't look problematic; it should actually be kind of a best case,
especially with the above parameters.

Does anyone have any clues or ideas?  In particular, is there a simple
way to diagnose *where* exactly the waiting times happen?

Regards

Antoine.

Reply via email to