Hello, Jesper! Nice to see not only in erlang community!
воскресенье, 19 марта 2017 г., 18:09:17 UTC+3 пользователь Jesper Louis 
Andersen написал:
>
> My approach is usually this:
>
> When a problem like this occurs, I very quickly switch from random 
> guessing at what the problem can be into a mode where I try to verify the 
> mental model I have of the system. Your mental model is likely wrong, and 
> thus it is leading you astray in what the problem might be. So I start 
> devising metrics that can support the mental model I have. Often, when your 
> model is corrected, you start understanding the pathology of the system. I 
> tend to start from the bottom and work up through the layers, trying to 
> verify in each layer that I'm seeing behavior that isn't out of the 
> ordinary from the mental model I have.
>

I'm absolutely agree with you on that. The first put forward a hypotysis, 
and then try to confirm or disapprove hypothesis! The problem is, I have no 
more hypothesis!
 

>
> * At 4000 req/s, we are implicitly assuming that each request look the 
> same. Otherwise that is a weak metric as an indicator of system behavior. 
> Are they the same and take the same work? If we log the slowest request 
> every 5 seconds, what does it look like compared to one of the typical ones.
>

The all requests is the same, and have the same behavior! I'm log all 
requests and they all similar.
 

> * The 99th percentile ignores the 40 slowest queries. What does the 99.9, 
> 9.99, ... and max percentiles look like?
>

I'v have no answer to this question. And I don't know how it can help me?
 

> * What lies between the external measurement and the internal measurement? 
> Can we inject a metric for each of those?
>

Yep, it's the also the main question! I'm log and graph nginx 
$request_time, and log and graph internal function time. What is between, I 
can't log, it's:
 - local network (TCP);
 - work in kernel/user space;
 - golang GC and other run-time;
 - golang fasthttp machinery before call my http handler.
 

> * The operating system and environment is only doing work for us, and not 
> for someone else because it is virtualized, or some other operation is 
> running.
>

Only for us! There is no other application that can impact on my 
application performance!
 

> * There is enough bandwidth.
>

Looks like bandwidth is enough, this show me my graphs. And as I know, 
local network inside the one server can't affect application performance so 
much.
 

> * Caches have hit/miss rates that looks about right.
>

In my application this is not true caches, it real it's dictionary loaded 
from database, and user in calculation.
 

> * The cache also caches negative responses. That is, if an element is not 
> present in the backing store, a lookup in the cache will not fail on 
> repeated requests and go the said backing store.
>

- my answer earlier - )
 

> * 15% CPU load means we are spending ample amounts of time waiting. What 
> are we waiting on?
>

Maybe, or maybe the 32 core can process the 4k rps. How can I find out, 
what my app is waiting on?
 

> Start measuring foreign support systems further down the chain. Don't 
> trust your external partners. Especially if they are a network connection 
> away. What are the latencies for the waiting down the line?
>

Yep, I measure latency on my side, using nginx, and log $request_time and 
graph it after that.
 

> * Are we measuring the right thing in the internal measurements? If the 
> window between external/internal is narrow, then chances are we are doing 
> the wrong thing on the internal side.
>

Could you explain this?
 

>
> Google's SRE handbook mentions the 4 "golden" metrics. If nothing else, 
> measuring those on a system can often tell you if it is behaving or not.
>
> On Sun, Mar 19, 2017 at 3:47 PM Alexander Petrovsky <askj...@gmail.com 
> <javascript:>> wrote:
>
>> Hello, Dave!
>>
>> воскресенье, 19 марта 2017 г., 3:28:13 UTC+3 пользователь David 
>> Collier-Brown написал:
>>
>>> Are you seeing the average response time / latency of the cache from 
>>> outside? 
>>>
>>
>> I don't calculate average, I'm using percentiles! Looks like the "cache" 
>> don't affect at all, otherwise I'll seen that on my graphs, since I'm 
>> calling my cache inside http handler between timings.
>>  
>>
>>> If so, you should see lots of really quick responeses, and a few ones as 
>>> slow as inside that average to what you're seeing.
>>>
>>
>> No, as I said, I'm using only percentiles, not average.
>>  
>>
>>>
>>> --dave
>>>
>>>
>>> On Saturday, March 18, 2017 at 3:52:21 PM UTC-4, Alexander Petrovsky 
>>> wrote:
>>>>
>>>> Hello!
>>>>
>>>> Colleagues, I need your help!
>>>>
>>>> And so, I have the application, that accept through http (fasthttp) 
>>>> dynamic json, unmarshal it to the map[string]interface{} using ffjson, 
>>>> after that some fields reads into struct, then using this struct I make 
>>>> some calculations, and then struct fields writes into 
>>>> map[string]interface{}, this map writes to kafka (asynchronous), and 
>>>> finally the result reply to client through http. Also, I have 2 caches, 
>>>> one 
>>>> contains 100 millions and second 20 millions items, this caches build 
>>>> using 
>>>> freecache to avoid slooooow GC pauses. Incoming rate is 4k rps per server 
>>>> (5 servers at all), total cpu utilisation about 15% per server.
>>>>
>>>> The problem — my latency measurements show me that inside application 
>>>> latency significantly less then outside.
>>>> 1. How I measure latency?
>>>>     - I've add timings into http function handlers, and after that make 
>>>> graphs.
>>>> 2. How I understood that latency inside application significantly less 
>>>> then outside?
>>>>     - I'm installed in front of my application the nginx server and log 
>>>> $request_time, $upstream_response_time, after that make graphs too.
>>>>
>>>> It graphs show me that inside application latency is about 500 
>>>> microseconds in 99 percentile, and about 10-15 milliseconds outside 
>>>> (nginx). The nginx and my app works on the same server. My graphs show me 
>>>> that GC occur every 30-40 seconds, and works less then 3 millisecond.
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-HOZJ9iwMyyw/WM2POBUU1MI/AAAAAAAABV8/jhIV1f_PBxwPbs7fSmbqg5WJfKhB-CONgCLcB/s1600/1.png>
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-Z-3-RgNcpN0/WM2PSCKXebI/AAAAAAAABWA/u-QhZs2YfzwzP6DHzu_7cT2toU-px-azACLcB/s1600/2.png>
>>>>
>>>>
>>>> Could someone help me find the problem and profile my application?
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to