My approach is usually this:

When a problem like this occurs, I very quickly switch from random guessing
at what the problem can be into a mode where I try to verify the mental
model I have of the system. Your mental model is likely wrong, and thus it
is leading you astray in what the problem might be. So I start devising
metrics that can support the mental model I have. Often, when your model is
corrected, you start understanding the pathology of the system. I tend to
start from the bottom and work up through the layers, trying to verify in
each layer that I'm seeing behavior that isn't out of the ordinary from the
mental model I have.

* At 4000 req/s, we are implicitly assuming that each request look the
same. Otherwise that is a weak metric as an indicator of system behavior.
Are they the same and take the same work? If we log the slowest request
every 5 seconds, what does it look like compared to one of the typical ones.
* The 99th percentile ignores the 40 slowest queries. What does the 99.9,
9.99, ... and max percentiles look like?
* What lies between the external measurement and the internal measurement?
Can we inject a metric for each of those?
* The operating system and environment is only doing work for us, and not
for someone else because it is virtualized, or some other operation is
running.
* There is enough bandwidth.
* Caches have hit/miss rates that looks about right.
* The cache also caches negative responses. That is, if an element is not
present in the backing store, a lookup in the cache will not fail on
repeated requests and go the said backing store.
* 15% CPU load means we are spending ample amounts of time waiting. What
are we waiting on? Start measuring foreign support systems further down the
chain. Don't trust your external partners. Especially if they are a network
connection away. What are the latencies for the waiting down the line?
* Are we measuring the right thing in the internal measurements? If the
window between external/internal is narrow, then chances are we are doing
the wrong thing on the internal side.


Google's SRE handbook mentions the 4 "golden" metrics. If nothing else,
measuring those on a system can often tell you if it is behaving or not.

On Sun, Mar 19, 2017 at 3:47 PM Alexander Petrovsky <askju...@gmail.com>
wrote:

> Hello, Dave!
>
> воскресенье, 19 марта 2017 г., 3:28:13 UTC+3 пользователь David
> Collier-Brown написал:
>
> Are you seeing the average response time / latency of the cache from
> outside?
>
>
> I don't calculate average, I'm using percentiles! Looks like the "cache"
> don't affect at all, otherwise I'll seen that on my graphs, since I'm
> calling my cache inside http handler between timings.
>
>
> If so, you should see lots of really quick responeses, and a few ones as
> slow as inside that average to what you're seeing.
>
>
> No, as I said, I'm using only percentiles, not average.
>
>
>
> --dave
>
>
> On Saturday, March 18, 2017 at 3:52:21 PM UTC-4, Alexander Petrovsky wrote:
>
> Hello!
>
> Colleagues, I need your help!
>
> And so, I have the application, that accept through http (fasthttp)
> dynamic json, unmarshal it to the map[string]interface{} using ffjson,
> after that some fields reads into struct, then using this struct I make
> some calculations, and then struct fields writes into
> map[string]interface{}, this map writes to kafka (asynchronous), and
> finally the result reply to client through http. Also, I have 2 caches, one
> contains 100 millions and second 20 millions items, this caches build using
> freecache to avoid slooooow GC pauses. Incoming rate is 4k rps per server
> (5 servers at all), total cpu utilisation about 15% per server.
>
> The problem — my latency measurements show me that inside application
> latency significantly less then outside.
> 1. How I measure latency?
>     - I've add timings into http function handlers, and after that make
> graphs.
> 2. How I understood that latency inside application significantly less
> then outside?
>     - I'm installed in front of my application the nginx server and log
> $request_time, $upstream_response_time, after that make graphs too.
>
> It graphs show me that inside application latency is about 500
> microseconds in 99 percentile, and about 10-15 milliseconds outside
> (nginx). The nginx and my app works on the same server. My graphs show me
> that GC occur every 30-40 seconds, and works less then 3 millisecond.
>
>
> <https://lh3.googleusercontent.com/-HOZJ9iwMyyw/WM2POBUU1MI/AAAAAAAABV8/jhIV1f_PBxwPbs7fSmbqg5WJfKhB-CONgCLcB/s1600/1.png>
>
>
> <https://lh3.googleusercontent.com/-Z-3-RgNcpN0/WM2PSCKXebI/AAAAAAAABWA/u-QhZs2YfzwzP6DHzu_7cT2toU-px-azACLcB/s1600/2.png>
>
>
> Could someone help me find the problem and profile my application?
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to