Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

Paul Emmerich Fri, 22 Feb 2019 06:05:44 -0800

Bad SSDs can also cause this. Which SSD are you using?

Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Feb 22, 2019 at 2:53 PM Massimo Sgaravatto
<massimo.sgarava...@gmail.com> wrote:
>
> A couple of hints to debug the issue (since I had to recently debug a problem 
> with the same symptoms):
>
> - As far as I understand the reported 'implicated osds' are only the primary 
> ones. In the log of the osds you should find also the relevant pg number, and 
> with this information you can get all the involved OSDs. This might be useful 
> e.g. to see if a specific OSD node is always involved. This was my case (a 
> the problem was with the patch cable connecting the node)
>
> - You can use the "ceph daemon osd.x dump_historic_ops" command to debug some 
> of these slow requests (to see which events take much time)
>
> Cheers, Massimo
>
> On Fri, Feb 22, 2019 at 10:28 AM mart.v <mar...@seznam.cz> wrote:
>>
>> Hello everyone,
>>
>> I'm experiencing a strange behaviour. My cluster is relatively small (43 
>> OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected 
>> via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with 
>> different pools. Descibed error is only on the SSD part of the cluster.
>>
>> I noticed that few times a day the cluster slows down a bit and I have 
>> discovered this in logs:
>>
>> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159 : 
>> cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec. 
>> Implicated osds 10,22,33 (REQUEST_SLOW)
>> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169 : 
>> cluster [WRN] Health check update: 199 slow requests are blocked > 32 sec. 
>> Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41 
>> (REQUEST_SLOW)
>> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183 : 
>> cluster [WRN] Health check update: 448 slow requests are blocked > 32 sec. 
>> Implicated osds 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 
>> (REQUEST_SLOW)
>> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210 : 
>> cluster [WRN] Health check update: 388 slow requests are blocked > 32 sec. 
>> Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW)
>> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214 : 
>> cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests are 
>> blocked > 32 sec. Implicated osds 8,16)
>>
>> "ceph health detail" shows nothing more
>>
>> It is happening through the whole day and the times can't be linked to any 
>> read or write intensive task (e.g. backup). I also tried to disable 
>> scrubbing, but it kept on going. These errors were not there since 
>> beginning, but unfortunately I cannot track the day they started (it is 
>> beyond my logs).
>>
>> Any ideas?
>>
>> Thank you!
>> Martin
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

Reply via email to