Re: [grpc-io] Re: A single synchronous call takes ~230ms on local host!?! What am I doing wrong?

'Carl Mastrangelo' via grpc.io Wed, 05 Sep 2018 14:24:39 -0700

One thing I noticed is that you are sharing the same timespec in your
service.  I think you need to make one per RPC, in the event RPCs arrive at
the same time.  From reading the client it does seem serial (again, correct
me if I'm mistaken).


As for serialization:  It's a fine line to walk.  Some of the serializers
are better supported than others, and it might give the wrong impression
that gRPC recommends them.  For example, I recently made a blog post
<https://grpc.io/blog/grpc-with-json> about how to plug JSON as the
encoding.  The code it more boilerplate, but works better with existing
codebases.  If you are happy with the flatbuffer support, I'd be willing to
take that as a signal that it's ready for use (real user feedback is more
meaningful than evaluating it ourselves).  Consider sending a PR to
https://github.com/grpc/grpc.github.io  where the documentation lives.

+Vijay for the C++ performance stuff.


On Wed, Sep 5, 2018 at 12:14 PM Amogh Akshintala <[email protected]>
wrote:

> You’re correct, but AFAIK choice of measurement is irrelevant in this case
> as the time intervals are relatively large (ms)
> I’d have used RDTSC but the machine I’m on has DVFS turned on and I was
> too lazy to change that…
> Didn’t know there was a wrapper in the std library (I’m more of a C
> programmer)…
> I’ll look into using std::chrono::steady_clock next time…
>
> The numbers I’m reporting aren’t from one RPC call, however. It’s an
> average over several invocations of the same RPC:
>
> Sample:
> $ taskset -c 1 ./client -f ../data/video/__909tIOxbc_temp.mp4
> video file: ../data/video/__909tIOxbc_temp.mp4
> Press control-c to quit at any point
>  6 objects detected.
> This request took 247.746 milliseconds
>  6 objects detected.
> This request took 240.866 milliseconds
>  7 objects detected.
> This request took 230.565 milliseconds
>  8 objects detected.
> This request took 227.763 milliseconds
>  10 objects detected.
> This request took 229.529 milliseconds
>  10 objects detected.
> This request took 225.853 milliseconds
>  10 objects detected.
> This request took 220.809 milliseconds
>  9 objects detected.
> This request took 229.487 milliseconds
>  8 objects detected.
> This request took 237.131 milliseconds
>  7 objects detected.
> This request took 228.043 milliseconds
>  8 objects detected.
> This request took 233.57 milliseconds
>  9 objects detected.
> This request took 230.96 milliseconds
>  9 objects detected.
> This request took 226.805 milliseconds
>  10 objects detected.
> This request took 222.559 milliseconds
>  9 objects detected.
> This request took 233.108 milliseconds
>  9 objects detected.
> This request took 225.9 milliseconds
>  8 objects detected.
> This request took 227.267 milliseconds
>  9 objects detected.
> This request took 227.419 milliseconds
>  9 objects detected.
> This request took 232.574 milliseconds
>  9 objects detected.
> This request took 231.147 milliseconds
>  10 objects detected.
> This request took 230.447 milliseconds
>
> Corresponding server measurements:
> $taskset -c 5 ./server ../cfg/coco.data ../cfg/yolov3.cfg
> ../weights/yolov3.weights
> Server listening on localhost:50051
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.904 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.514 milliseconds
> 0x7fff5d1dcc90Server took 66.55 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.804 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.263 milliseconds
> 0x7fff5d1dcc90Server took 66.288 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.762 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.797 milliseconds
> 0x7fff5d1dcc90Server took 66.825 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.768 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.182 milliseconds
> 0x7fff5d1dcc90Server took 66.209 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.78 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.244 milliseconds
> 0x7fff5d1dcc90Server took 66.271 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.772 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.102 milliseconds
> 0x7fff5d1dcc90Server took 66.13 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.768 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.107 milliseconds
> 0x7fff5d1dcc90Server took 66.135 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.788 milliseconds
> 0x7fff5d1dcc90 doDetection: took 71.868 milliseconds
> 0x7fff5d1dcc90Server took 72.365 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.795 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.157 milliseconds
> 0x7fff5d1dcc90Server took 66.186 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.815 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.255 milliseconds
> 0x7fff5d1dcc90Server took 66.284 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.746 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.489 milliseconds
> 0x7fff5d1dcc90Server took 66.516 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.902 milliseconds
> 0x7fff5d1dcc90 doDetection: took 68.414 milliseconds
> 0x7fff5d1dcc90Server took 68.452 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.758 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.615 milliseconds
> 0x7fff5d1dcc90Server took 66.643 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.811 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.339 milliseconds
> 0x7fff5d1dcc90Server took 66.368 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 24.714 milliseconds
> 0x7fff5d1dcc90 doDetection: took 70.671 milliseconds
> 0x7fff5d1dcc90Server took 70.704 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.805 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.269 milliseconds
> 0x7fff5d1dcc90Server took 66.299 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.836 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.73 milliseconds
> 0x7fff5d1dcc90Server took 66.755 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.791 milliseconds
> 0x7fff5d1dcc90 doDetection: took 65.679 milliseconds
> 0x7fff5d1dcc90Server took 65.707 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.826 milliseconds
> 0x7fff5d1dcc90 doDetection: took 66.242 milliseconds
> 0x7fff5d1dcc90Server took 66.271 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.804 milliseconds
> 0x7fff5d1dcc90 doDetection: took 67.864 milliseconds
> 0x7fff5d1dcc90Server took 67.893 milliseconds
> doDetection: new requeust 0x7fff5d1dcc90
> 0x7fff5d1dcc90 GPU processing took 23.807 milliseconds
> 0x7fff5d1dcc90 doDetection: took 67.705 milliseconds
> 0x7fff5d1dcc90Server took 67.736 milliseconds
>
> Re flatbuffers:
> AFAIK, flatbuffers doesn’t transfer anything that is unused (fields can be
> marked deprecated).
> If the field isn’t set to a value other than the default value, it isn’t
> transferred over the wire.
> Instead such values are repopulated with the default value (specified in
> the schema) on the client/server side as needed..
> Of course this only works for scalars. If you have need to deprecate a
> vector, then yes you’re kinda hosed...
> Hmmm, I think the gRPC documentation should point to flatbuffers (which
> supports gRPC out of the box) as an alternative suited to low-latency cases
> (while making the drawbacks you mentioned clear).
>
> Cheers,
> Amogh Akshintala
> http://aakshintala.com
>
> On Wed, Sep 05, 2018 at 2:52 PM Carl Mastrangelo <Carl Mastrangelo
> <carl+mastrangelo+%[email protected]%3E>> wrote:
>
>> Correct me if I am mistaken, but gettimeofday will use the the realtime
>> clock rather than the monotonic clock.  The latter I believe is the correct
>> way to measure time differences.
>>
>> Second, if you only issue one RPC, it won't be as fast.  Code won't be in
>> cache, threads, won't be active, and TCP windows won't be large enough.
>>
>> Last: flatbuffers I believe are fixed size, which means that as you
>> update your proto with new fields, you'll still have the old fields stick
>> around.  This will put you in a situation where you have to either reuse
>> fields (which is suspect from a API compatibility POV) or accept the larger
>> size  (which hurts performance).   You might be willing to make this trade
>> if your API will seldom change, but generally they do change.  Plain
>> protobuf is slower now, for increased flexibility later.  You're certainly
>> allowed to swap out the serialization based on need, but I don't think the
>> gRPC team can recommend it.
>>
>> On Wed, Sep 5, 2018 at 11:35 AM Amogh Akshintala <[email protected]>
>> wrote:
>>
>>> Hey Carl,
>>>
>>> Thanks for your time!
>>>
>>> I see the same performance numbers as reported in the protobuf
>>> performance dashboard (~0.2ms per rpc) if I set up the HelloWorld client to
>>> call SayHello 1000 times in a tight loop and average it over the 1000 calls.
>>>
>>> How I arrived at the numbers I reported (everything is measured on
>>> localhost):
>>> The links are to my client and server code on GitHub.
>>> Client RPC Stub
>>> <https://github.com/aakshintala/darknet/blob/cf1c4dfeb2a2f1c3d123bc89f90edbb37854b25d/server/client.cpp#L100>
>>> :
>>> perform any necessary data manipulation
>>> Pack Request message
>>> *start = getTimeOfDay()*
>>> stub->invokeRPC()
>>> checkStatus()
>>> *end = getTimeofDay()*
>>> *Client RTT = end - start*
>>>
>>> In Server RPC ServiceImpl
>>> <https://github.com/aakshintala/darknet/blob/cf1c4dfeb2a2f1c3d123bc89f90edbb37854b25d/server/server.cpp#L43>
>>> :
>>> *start = getTimeofDay()*
>>> serviceRequest() <- GPU time is calculated inside this function, but
>>> ignore that for now.
>>> *end=getTimeofDay()*
>>> return Status
>>> *Server time = end - start*
>>>
>>> *gRPC + protobuf overhead = Client RTT - Server Time*
>>>
>>> I replaced protobuf with flatbuffers yesterday, after noticing (using
>>> perf)
>>> that a significant chunk of processing time was spent in protobuf
>>> serialization
>>> and deserialization code.
>>> Latency really improved with flatbuffer (no parsing, so…),
>>> but man is that library hard to use/debug compared to protobuf...
>>>
>>> New numbers *with flatbuffers:*
>>> *client RTT = ~70ms*
>>> *Server Time = ~40ms*
>>> *gRPC + flatbuffers = ~30ms (for ~4MiB of data over localhost)*
>>>
>>> Thanks for the link to pprof. Will check it out, especially if
>>>
>>> Cheers,
>>> Amogh Akshintala
>>> http://aakshintala.com
>>>
>>>
>>> On Wed, Sep 05, 2018 at 1:59 PM "'Carl Mastrangelo' via grpc.io" <">"'Carl
>>> Mastrangelo' via grpc.io" > wrote:
>>>
>>>> Our own benchmarks
>>>> <https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5636470266134528>
>>>> get about 1000x better latency than that, so something is definitely up.
>>>> Can you describe how you arrived at that line number, or the tools you used
>>>> to profile?  (We use perf and pprof)
>>>>
>>>> On Monday, September 3, 2018 at 10:39:45 AM UTC-7, [email protected]
>>>> wrote:
>>>>>
>>>>> Hail gRPC experts (;D),
>>>>>
>>>>> I'm trying to build a image/video object detection server (as one of
>>>>> the reusable pieces in a benchmark suite) with low RTT requirements
>>>>> (near-realtime say ~60-90ms RTT)...
>>>>> I've used gRPC and protobuf (built from git master; hashes below in
>>>>> case that is relevant) for the serialization and transport.
>>>>> _________________________________
>>>>> grpc:
>>>>> commit dbc1e27e2e1a81b61eb064eb036ec6a267f88cb6
>>>>> Merge: 9bc6cd1 5d24ab9
>>>>> Author: Jiangtao Li <email redacted by me>
>>>>> Date:   Fri Jul 20 17:00:18 2018 -0700
>>>>>
>>>>> protobuf:
>>>>> commit b5fbb742af122b565925987e65c08957739976a7
>>>>> Author: Bo Yang <email redacted by me>
>>>>> Date:   Mon Mar 5 19:54:18 2018 -0800
>>>>> _________________________________
>>>>>
>>>>> gRPC seems to add inane amounts of overhead -- ~160ms (~2x the
>>>>> server's processing time)!
>>>>> For now I'm running on a single machine (a pretty beefy machine, so
>>>>> contention isn't an issue...) operating over localhost (loopback).
>>>>> The amount of data being transferred is considerable, but not unheard
>>>>> off (~4MiB per request).
>>>>>
>>>>> Server-side timing measurements:
>>>>> doDetection: new requeust 0x7ffc77f16920
>>>>> 0x7ffc77f16920: GPU processing took 24.045 milliseconds
>>>>> 0x7ffc77f16920: Server took *72.206 millisecond*
>>>>>
>>>>> Client-side measurements:
>>>>> 10 objects detected.
>>>>> This request took *234.825 milliseconds *
>>>>>
>>>>> *Client RTT - Server processing time = 234.85-72.206 = 162.644ms
>>>>> (!??!)*
>>>>> I've pinned the server and client to separate cores using taskset.
>>>>> There isn't anything else running on the server and it's a beefy 48
>>>>> core (Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz) machine with ample
>>>>> RAM (128GiB), etc....
>>>>>
>>>>> As a start, I instrumented the implementation of the synchronous call
>>>>> in include/grpcpp/impl/codegen/client_unary_call.h:
>>>>> BlockingUnaryCallImpl(ChannelInterface* channel, const RpcMethod&
>>>>> method,
>>>>>                          ClientContext* context, const InputMessage&
>>>>> request,
>>>>>                          OutputMessage* result)
>>>>>
>>>>> and found that the vast majority of the time is spent spinning on a
>>>>> completion queue:
>>>>> line 107:   if (cq.Pluck(&ops)) {
>>>>>
>>>>> I wonder if I need to configure gRPC differently (perhaps the default
>>>>> configurations are more geared towards latency-insensitive batching?)...
>>>>>
>>>>> Any help understanding these numbers would be appreciated.
>>>>> Server code:
>>>>> https://github.com/aakshintala/darknet/blob/master/server/server.cpp
>>>>> Client code:
>>>>> https://github.com/aakshintala/darknet/blob/master/server/client.cpp
>>>>> Proto file:
>>>>> https://github.com/aakshintala/darknet/blob/master/server/darknetserver.proto
>>>>>
>>>>> Thanks in advance,
>>>>> Amogh Akshintala
>>>>> aakshintala.com
>>>>>
>>>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "grpc.io" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/grpc-io/USjGJDmu_Hw/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/grpc-io.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/grpc-io/c37ac6ed-9149-43dc-b9a3-5574e4eca439%40googlegroups.com
>>> <https://groups.google.com/d/msgid/grpc-io/c37ac6ed-9149-43dc-b9a3-5574e4eca439%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/CAAcqB%2BtTX_UU0gfc_wr91Z%2B84Fej35sx3ZxgtWee4BvJg_67JQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [grpc-io] Re: A single synchronous call takes ~230ms on local host!?! What am I doing wrong?

Reply via email to