A quick glance suggests you're limited by the kernel copying memory around (https://gist.github.com/fsaintjacques/1fa00c8e50a09325960d8dc7463c497e). I think the next step is to use different physical hosts for client and server. This way you'll free resources for the server.
François On Thu, Feb 21, 2019 at 12:42 PM Antoine Pitrou <anto...@python.org> wrote: > > We're talking about the BCC tools, which are not based on perf: > https://github.com/iovisor/bcc/ > > Apparently, using Linux perf for the same purpose is some kind of hassle > (you need to write perf scripts?). > > Regards > > Antoine. > > > Le 21/02/2019 à 18:40, Francois Saint-Jacques a écrit : > > You can compile with dwarf (-g/-ggdb) and use `--call-graph=dwarf` to > perf, > > it'll help the unwinding. Sometimes it's better than the stack pointer > > method since it keep track of inlined functions. > > > > On Thu, Feb 21, 2019 at 12:39 PM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> Ah, thanks. I'm trying it now. The problem is that it doesn't record > >> userspace stack traces properly (it probably needs all dependencies to > >> be recompiled with -fno-omit-frame-pointer :-/). So while I know that a > >> lot of time is spent waiting for futextes, I don't know if that is for a > >> legitimate reason... > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 21/02/2019 à 17:52, Hatem Helal a écrit : > >>> I was thinking of this variant: > >>> > >>> http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > >>> > >>> but I must admit that I haven't tried that technique myself. > >>> > >>> > >>> > >>> On 2/21/19, 4:41 PM, "Antoine Pitrou" <solip...@pitrou.net> wrote: > >>> > >>> > >>> I don't think that's the answer here. The question is not how > >>> to /visualize/ where time is spent waiting, but how to /measure/ > it. > >>> Normal profiling only tells you where CPU time is spent, not what > the > >>> process is idly waiting for. > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > >>> On Thu, 21 Feb 2019 16:29:15 +0000 > >>> Hatem Helal <hhe...@mathworks.com> wrote: > >>> > I like flamegraphs for investigating this sort of problem: > >>> > > >>> > https://github.com/brendangregg/FlameGraph > >>> > > >>> > There are likely many other techniques for inspecting where time > >> is being spent but that can at least help narrow down the search space. > >>> > > >>> > On 2/21/19, 4:03 PM, "Francois Saint-Jacques" < > >> fsaintjacq...@gmail.com> wrote: > >>> > > >>> > Can you remind us what's the easiest way to get flight > working > >> with grpc? > >>> > clone + make install doesn't really work out of the box. > >>> > > >>> > François > >>> > > >>> > On Thu, Feb 21, 2019 at 10:41 AM Antoine Pitrou < > >> anto...@python.org> wrote: > >>> > > >>> > > > >>> > > Hello, > >>> > > > >>> > > I've been trying to saturate several CPU cores using our > >> Flight > >>> > > benchmark (which spawns a server process and attempts to > >> communicate > >>> > > with it using multiple clients), but haven't managed to. > >>> > > > >>> > > The typical command-line I'm executing is the following: > >>> > > > >>> > > $ time taskset -c 1,3,5,7 > >> ./build/release/arrow-flight-benchmark > >>> > > -records_per_stream 50000000 -num_streams 16 -num_threads > 32 > >>> > > -records_per_batch 120000 > >>> > > > >>> > > Breakdown: > >>> > > > >>> > > - "time": I want to get CPU user / system / wall-clock > times > >>> > > > >>> > > - "taskset -c ...": I have a 8-core 16-threads machine and > I > >> want to > >>> > > allow scheduling RPC threads on 4 distinct physical cores > >>> > > > >>> > > - "-records_per_stream": I want each stream to have enough > >> records so > >>> > > that connection / stream setup costs are negligible > >>> > > > >>> > > - "-num_streams": this is the number of streams the > >> benchmark tries to > >>> > > download (DoGet()) from the server to the client > >>> > > > >>> > > - "-num_threads": this is the number of client threads the > >> benchmark > >>> > > makes download requests from. Since our client is > >> currently > >>> > > blocking, it makes sense to have a large number of client > >> threads (to > >>> > > allow overlap). Note that each thread creates a separate > >> gRPC client > >>> > > and connection. > >>> > > > >>> > > - "-records_per_batch": transfer enough records per > >> individual RPC > >>> > > message, to minimize overhead. This number brings us > >> close to the > >>> > > default gRPC message limit of 4 MB. > >>> > > > >>> > > The results I get look like: > >>> > > > >>> > > Bytes read: 25600000000 > >>> > > Nanos: 8433804781 > >>> > > Speed: 2894.79 MB/s > >>> > > > >>> > > real 0m8,569s > >>> > > user 0m6,085s > >>> > > sys 0m15,667s > >>> > > > >>> > > > >>> > > If we divide (user + sys) by real, we conclude that 2.5 > >> cores are > >>> > > saturated by this benchmark. Evidently, this means that > the > >> benchmark > >>> > > is waiting a *lot*. The question is: where? > >>> > > > >>> > > Here is some things I looked at: > >>> > > > >>> > > - mutex usage inside Arrow. None seems to pop up (printf > is > >> my friend). > >>> > > > >>> > > - number of threads used by the gRPC server. gRPC > >> implicitly spawns a > >>> > > number of threads to handle incoming client requests. > >> I've checked > >>> > > (using printf...) that several threads are indeed used to > >> serve > >>> > > incoming connections. > >>> > > > >>> > > - CPU usage bottlenecks. 80% of the entire benchmark's CPU > >> time is > >>> > > spent in memcpy() calls in the *client* (precisely, in > the > >>> > > grpc_byte_buffer_reader_readall() call inside > >>> > > arrow::flight::internal::FlightDataDeserialize()). It > >> doesn't look > >>> > > like the server is the bottleneck. > >>> > > > >>> > > - the benchmark connects to "localhost". I've changed it > to > >>> > > "127.0.0.1", it doesn't make a difference. AFAIK, > >> localhost TCP > >>> > > connections should be well-optimized on Linux. It seems > >> highly > >>> > > unlikely that they would incur idle waiting times (rather > >> than CPU > >>> > > time processing packets). > >>> > > > >>> > > - RAM usage. It's quite reasonable at 220 MB (client) + 75 > >> MB > >>> > > (server). No swapping occurs. > >>> > > > >>> > > - Disk I/O. "vmstat" tells me no block I/O happens during > >> the > >>> > > benchmark. > >>> > > > >>> > > - As a reference, I can transfer 5 GB/s over a single TCP > >> connection > >>> > > using plain sockets in a simple Python script. 3 GB/s > >> over multiple > >>> > > connections doesn't look terrific. > >>> > > > >>> > > > >>> > > So it looks like there's a scalability issue inside our > >> current Flight > >>> > > code, or perhaps inside gRPC. The benchmark itself, if > >> simplistic, > >>> > > doesn't look problematic; it should actually be kind of a > >> best case, > >>> > > especially with the above parameters. > >>> > > > >>> > > Does anyone have any clues or ideas? In particular, is > >> there a simple > >>> > > way to diagnose *where* exactly the waiting times happen? > >>> > > > >>> > > Regards > >>> > > > >>> > > Antoine. > >>> > > > >>> > > >>> > > >>> > >>> > >>> > >>> > >>> > >> > > >