Hello, Jesper! Nice to see not only in erlang community!
воскресенье, 19 марта 2017 г., 18:09:17 UTC+3 пользователь Jesper Louis Andersen написал: > > My approach is usually this: > > When a problem like this occurs, I very quickly switch from random > guessing at what the problem can be into a mode where I try to verify the > mental model I have of the system. Your mental model is likely wrong, and > thus it is leading you astray in what the problem might be. So I start > devising metrics that can support the mental model I have. Often, when your > model is corrected, you start understanding the pathology of the system. I > tend to start from the bottom and work up through the layers, trying to > verify in each layer that I'm seeing behavior that isn't out of the > ordinary from the mental model I have. > I'm absolutely agree with you on that. The first put forward a hypotysis, and then try to confirm or disapprove hypothesis! The problem is, I have no more hypothesis! > > * At 4000 req/s, we are implicitly assuming that each request look the > same. Otherwise that is a weak metric as an indicator of system behavior. > Are they the same and take the same work? If we log the slowest request > every 5 seconds, what does it look like compared to one of the typical ones. > The all requests is the same, and have the same behavior! I'm log all requests and they all similar. > * The 99th percentile ignores the 40 slowest queries. What does the 99.9, > 9.99, ... and max percentiles look like? > I'v have no answer to this question. And I don't know how it can help me? > * What lies between the external measurement and the internal measurement? > Can we inject a metric for each of those? > Yep, it's the also the main question! I'm log and graph nginx $request_time, and log and graph internal function time. What is between, I can't log, it's: - local network (TCP); - work in kernel/user space; - golang GC and other run-time; - golang fasthttp machinery before call my http handler. > * The operating system and environment is only doing work for us, and not > for someone else because it is virtualized, or some other operation is > running. > Only for us! There is no other application that can impact on my application performance! > * There is enough bandwidth. > Looks like bandwidth is enough, this show me my graphs. And as I know, local network inside the one server can't affect application performance so much. > * Caches have hit/miss rates that looks about right. > In my application this is not true caches, it real it's dictionary loaded from database, and user in calculation. > * The cache also caches negative responses. That is, if an element is not > present in the backing store, a lookup in the cache will not fail on > repeated requests and go the said backing store. > - my answer earlier - ) > * 15% CPU load means we are spending ample amounts of time waiting. What > are we waiting on? > Maybe, or maybe the 32 core can process the 4k rps. How can I find out, what my app is waiting on? > Start measuring foreign support systems further down the chain. Don't > trust your external partners. Especially if they are a network connection > away. What are the latencies for the waiting down the line? > Yep, I measure latency on my side, using nginx, and log $request_time and graph it after that. > * Are we measuring the right thing in the internal measurements? If the > window between external/internal is narrow, then chances are we are doing > the wrong thing on the internal side. > Could you explain this? > > Google's SRE handbook mentions the 4 "golden" metrics. If nothing else, > measuring those on a system can often tell you if it is behaving or not. > > On Sun, Mar 19, 2017 at 3:47 PM Alexander Petrovsky <askj...@gmail.com > <javascript:>> wrote: > >> Hello, Dave! >> >> воскресенье, 19 марта 2017 г., 3:28:13 UTC+3 пользователь David >> Collier-Brown написал: >> >>> Are you seeing the average response time / latency of the cache from >>> outside? >>> >> >> I don't calculate average, I'm using percentiles! Looks like the "cache" >> don't affect at all, otherwise I'll seen that on my graphs, since I'm >> calling my cache inside http handler between timings. >> >> >>> If so, you should see lots of really quick responeses, and a few ones as >>> slow as inside that average to what you're seeing. >>> >> >> No, as I said, I'm using only percentiles, not average. >> >> >>> >>> --dave >>> >>> >>> On Saturday, March 18, 2017 at 3:52:21 PM UTC-4, Alexander Petrovsky >>> wrote: >>>> >>>> Hello! >>>> >>>> Colleagues, I need your help! >>>> >>>> And so, I have the application, that accept through http (fasthttp) >>>> dynamic json, unmarshal it to the map[string]interface{} using ffjson, >>>> after that some fields reads into struct, then using this struct I make >>>> some calculations, and then struct fields writes into >>>> map[string]interface{}, this map writes to kafka (asynchronous), and >>>> finally the result reply to client through http. Also, I have 2 caches, >>>> one >>>> contains 100 millions and second 20 millions items, this caches build >>>> using >>>> freecache to avoid slooooow GC pauses. Incoming rate is 4k rps per server >>>> (5 servers at all), total cpu utilisation about 15% per server. >>>> >>>> The problem — my latency measurements show me that inside application >>>> latency significantly less then outside. >>>> 1. How I measure latency? >>>> - I've add timings into http function handlers, and after that make >>>> graphs. >>>> 2. How I understood that latency inside application significantly less >>>> then outside? >>>> - I'm installed in front of my application the nginx server and log >>>> $request_time, $upstream_response_time, after that make graphs too. >>>> >>>> It graphs show me that inside application latency is about 500 >>>> microseconds in 99 percentile, and about 10-15 milliseconds outside >>>> (nginx). The nginx and my app works on the same server. My graphs show me >>>> that GC occur every 30-40 seconds, and works less then 3 millisecond. >>>> >>>> >>>> <https://lh3.googleusercontent.com/-HOZJ9iwMyyw/WM2POBUU1MI/AAAAAAAABV8/jhIV1f_PBxwPbs7fSmbqg5WJfKhB-CONgCLcB/s1600/1.png> >>>> >>>> >>>> <https://lh3.googleusercontent.com/-Z-3-RgNcpN0/WM2PSCKXebI/AAAAAAAABWA/u-QhZs2YfzwzP6DHzu_7cT2toU-px-azACLcB/s1600/2.png> >>>> >>>> >>>> Could someone help me find the problem and profile my application? >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.