Our own benchmarks <https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5636470266134528> get about 1000x better latency than that, so something is definitely up. Can you describe how you arrived at that line number, or the tools you used to profile? (We use perf and pprof)
On Monday, September 3, 2018 at 10:39:45 AM UTC-7, [email protected] wrote: > > Hail gRPC experts (;D), > > I'm trying to build a image/video object detection server (as one of the > reusable pieces in a benchmark suite) with low RTT requirements > (near-realtime say ~60-90ms RTT)... > I've used gRPC and protobuf (built from git master; hashes below in case > that is relevant) for the serialization and transport. > _________________________________ > grpc: > commit dbc1e27e2e1a81b61eb064eb036ec6a267f88cb6 > Merge: 9bc6cd1 5d24ab9 > Author: Jiangtao Li <email redacted by me> > Date: Fri Jul 20 17:00:18 2018 -0700 > > protobuf: > commit b5fbb742af122b565925987e65c08957739976a7 > Author: Bo Yang <email redacted by me> > Date: Mon Mar 5 19:54:18 2018 -0800 > _________________________________ > > gRPC seems to add inane amounts of overhead -- ~160ms (~2x the server's > processing time)! > For now I'm running on a single machine (a pretty beefy machine, so > contention isn't an issue...) operating over localhost (loopback). > The amount of data being transferred is considerable, but not unheard off > (~4MiB per request). > > Server-side timing measurements: > doDetection: new requeust 0x7ffc77f16920 > 0x7ffc77f16920: GPU processing took 24.045 milliseconds > 0x7ffc77f16920: Server took *72.206 millisecond* > > Client-side measurements: > 10 objects detected. > This request took *234.825 milliseconds * > > *Client RTT - Server processing time = 234.85-72.206 = 162.644ms (!??!)* > I've pinned the server and client to separate cores using taskset. > There isn't anything else running on the server and it's a beefy 48 core > (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz) machine with ample RAM > (128GiB), etc.... > > As a start, I instrumented the implementation of the synchronous call > in include/grpcpp/impl/codegen/client_unary_call.h: > BlockingUnaryCallImpl(ChannelInterface* channel, const RpcMethod& method, > ClientContext* context, const InputMessage& > request, > OutputMessage* result) > > and found that the vast majority of the time is spent spinning on a > completion queue: > line 107: if (cq.Pluck(&ops)) { > > I wonder if I need to configure gRPC differently (perhaps the default > configurations are more geared towards latency-insensitive batching?)... > > Any help understanding these numbers would be appreciated. > Server code: > https://github.com/aakshintala/darknet/blob/master/server/server.cpp > Client code: > https://github.com/aakshintala/darknet/blob/master/server/client.cpp > Proto file: > https://github.com/aakshintala/darknet/blob/master/server/darknetserver.proto > > Thanks in advance, > Amogh Akshintala > aakshintala.com > > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/grpc-io. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/c37ac6ed-9149-43dc-b9a3-5574e4eca439%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
