deepscrub latency overhead

Paul Emmerich Thu, 12 Jul 2018 05:52:33 -0700

2018-07-12 8:37 GMT+02:00 Marc Schöchlin <m...@256bit.org>:

>
> In a first step i just would like to have  two simple KPIs which describe
> a average/aggregated write/read latency of these statistics.
>
> Are there tools/other functionalities which provide this in a simple way?
>
It's one of the main KPI our management software collects and visualizes:
https://croit.io


IIRC some of the other stats collectors also already collect these metrics,
at least I recall using it with Telegraf/InfluxDB.
But it's also really easy to collect yourself (I've once written it in bash
for some weird collector for a client), the only
hurdle is that you need to calculate the derivate because it collects a
running average.
I've some slides from our training about these metrics:
https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf
(Not much in there, it's more of a hands-on lab)



Paul



> Regards
> Marc
>
> Am 11.07.2018 um 18:42 schrieb Paul Emmerich:
>
> Hi,
>
> from experience: commit/apply_latency are not good metrics, the only good
> thing about them is that they are really easy to track.
> But we have found them to be almost completely useless in the real world.
>
> We track the op_*_latency metrics from perf dump and found them to be very
> helpful, they are more annoying to track due to their
> format. The median OSD is a good indicator and so is the slowest OSD.
>
> Paul
>
> 2018-07-11 17:50 GMT+02:00 Marc Schöchlin <m...@256bit.org>:
>
>> Hello ceph-users and ceph-devel list,
>>
>> we got in production with our new shiny luminous (12.2.5) cluster.
>> This cluster runs SSD and HDD based OSD pools.
>>
>> To ensure the service quality of the cluster and to have a baseline for
>> client latency optimization (i.e. in the area of deepscrub optimization)
>> we would like to have statistics about the client interaction latency of
>> our cluster.
>>
>> Which measures can be suitable to get such a "aggregated by
>> device_class" average latency KPI?
>> Also a percentile rank would be great (% amount of requests serviced by
>> < 5ms,  % amount of requests serviced by  < 20ms, % amount of requests
>> serviced by  < 50ms, ...)
>>
>> The following command provides a overview over the commit latency of the
>> osds but no average latency and no information about the device_class.
>>
>> ceph osd perf -f json-pretty
>>
>> {
>>     "osd_perf_infos": [
>>         {
>>             "id": 71,
>>             "perf_stats": {
>>                 "commit_latency_ms": 2,
>>                 "apply_latency_ms": 0
>>             }
>>         },
>>         {
>>             "id": 70,
>>             "perf_stats": {
>>                 "commit_latency_ms": 3,
>>                 "apply_latency_ms": 0
>>             }
>>
>> Device class information can be extracted of "ceph df -f json-pretty".
>>
>> But building averages of averages not seems to be a good thing .... :-)
>>
>> It seems that i can get more detailed information using the "ceph daemon
>> osd.<nr> perf histogram dump" command.
>> This seems to deliver the percentile rank information in a good detail
>> level.
>> (http://docs.ceph.com/docs/luminous/dev/perf_histograms/)
>>
>> My questions:
>>
>> Are there tools to analyze and aggregate these measures for a group of
>> OSDs?
>>
>> Which measures should i use as a baseline for client latency optimization?
>>
>> What is the time horizon of these measures?
>>
>> I sometimes see messages like this in my log.
>> This seems to be sourced in deep scrubbing. How can find the
>> source/solution of this problem?
>>
>> 2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy
>> 2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9
>> slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update:
>> 23 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update:
>> 27 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update:
>> 29 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update:
>> 39 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update:
>> 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12
>> slow requests are blocked > 32 sec
>> 2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update:
>> 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared:
>> REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
>> 2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy
>>
>> Regards
>> Marc
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> <https://maps.google.com/?q=Freseniusstr.+31h+%0D%0A++++++++++++++++++++81247+M%C3%BCnchen&entry=gmail&source=g>
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KPIs for Ceph/OSD client latency / deepscrub latency overhead

Reply via email to