2018-07-12 8:37 GMT+02:00 Marc Schöchlin <m...@256bit.org>: > > In a first step i just would like to have two simple KPIs which describe > a average/aggregated write/read latency of these statistics. > > Are there tools/other functionalities which provide this in a simple way? > It's one of the main KPI our management software collects and visualizes: https://croit.io
IIRC some of the other stats collectors also already collect these metrics, at least I recall using it with Telegraf/InfluxDB. But it's also really easy to collect yourself (I've once written it in bash for some weird collector for a client), the only hurdle is that you need to calculate the derivate because it collects a running average. I've some slides from our training about these metrics: https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf (Not much in there, it's more of a hands-on lab) Paul > Regards > Marc > > Am 11.07.2018 um 18:42 schrieb Paul Emmerich: > > Hi, > > from experience: commit/apply_latency are not good metrics, the only good > thing about them is that they are really easy to track. > But we have found them to be almost completely useless in the real world. > > We track the op_*_latency metrics from perf dump and found them to be very > helpful, they are more annoying to track due to their > format. The median OSD is a good indicator and so is the slowest OSD. > > Paul > > 2018-07-11 17:50 GMT+02:00 Marc Schöchlin <m...@256bit.org>: > >> Hello ceph-users and ceph-devel list, >> >> we got in production with our new shiny luminous (12.2.5) cluster. >> This cluster runs SSD and HDD based OSD pools. >> >> To ensure the service quality of the cluster and to have a baseline for >> client latency optimization (i.e. in the area of deepscrub optimization) >> we would like to have statistics about the client interaction latency of >> our cluster. >> >> Which measures can be suitable to get such a "aggregated by >> device_class" average latency KPI? >> Also a percentile rank would be great (% amount of requests serviced by >> < 5ms, % amount of requests serviced by < 20ms, % amount of requests >> serviced by < 50ms, ...) >> >> The following command provides a overview over the commit latency of the >> osds but no average latency and no information about the device_class. >> >> ceph osd perf -f json-pretty >> >> { >> "osd_perf_infos": [ >> { >> "id": 71, >> "perf_stats": { >> "commit_latency_ms": 2, >> "apply_latency_ms": 0 >> } >> }, >> { >> "id": 70, >> "perf_stats": { >> "commit_latency_ms": 3, >> "apply_latency_ms": 0 >> } >> >> Device class information can be extracted of "ceph df -f json-pretty". >> >> But building averages of averages not seems to be a good thing .... :-) >> >> It seems that i can get more detailed information using the "ceph daemon >> osd.<nr> perf histogram dump" command. >> This seems to deliver the percentile rank information in a good detail >> level. >> (http://docs.ceph.com/docs/luminous/dev/perf_histograms/) >> >> My questions: >> >> Are there tools to analyze and aggregate these measures for a group of >> OSDs? >> >> Which measures should i use as a baseline for client latency optimization? >> >> What is the time horizon of these measures? >> >> I sometimes see messages like this in my log. >> This seems to be sourced in deep scrubbing. How can find the >> source/solution of this problem? >> >> 2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy >> 2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4 >> slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9 >> slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update: >> 23 slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update: >> 27 slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update: >> 29 slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update: >> 39 slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update: >> 44 slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12 >> slow requests are blocked > 32 sec >> 2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update: >> 12 slow requests are blocked > 32 sec (REQUEST_SLOW) >> 2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared: >> REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec) >> 2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy >> >> Regards >> Marc >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A++++++++++++++++++++81247+M%C3%BCnchen&entry=gmail&source=g> > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com