Bad SSDs can also cause this. Which SSD are you using? Paul
-- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Feb 22, 2019 at 2:53 PM Massimo Sgaravatto <massimo.sgarava...@gmail.com> wrote: > > A couple of hints to debug the issue (since I had to recently debug a problem > with the same symptoms): > > - As far as I understand the reported 'implicated osds' are only the primary > ones. In the log of the osds you should find also the relevant pg number, and > with this information you can get all the involved OSDs. This might be useful > e.g. to see if a specific OSD node is always involved. This was my case (a > the problem was with the patch cable connecting the node) > > - You can use the "ceph daemon osd.x dump_historic_ops" command to debug some > of these slow requests (to see which events take much time) > > Cheers, Massimo > > On Fri, Feb 22, 2019 at 10:28 AM mart.v <mar...@seznam.cz> wrote: >> >> Hello everyone, >> >> I'm experiencing a strange behaviour. My cluster is relatively small (43 >> OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected >> via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with >> different pools. Descibed error is only on the SSD part of the cluster. >> >> I noticed that few times a day the cluster slows down a bit and I have >> discovered this in logs: >> >> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159 : >> cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec. >> Implicated osds 10,22,33 (REQUEST_SLOW) >> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169 : >> cluster [WRN] Health check update: 199 slow requests are blocked > 32 sec. >> Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41 >> (REQUEST_SLOW) >> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183 : >> cluster [WRN] Health check update: 448 slow requests are blocked > 32 sec. >> Implicated osds 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 >> (REQUEST_SLOW) >> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210 : >> cluster [WRN] Health check update: 388 slow requests are blocked > 32 sec. >> Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW) >> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214 : >> cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests are >> blocked > 32 sec. Implicated osds 8,16) >> >> "ceph health detail" shows nothing more >> >> It is happening through the whole day and the times can't be linked to any >> read or write intensive task (e.g. backup). I also tried to disable >> scrubbing, but it kept on going. These errors were not there since >> beginning, but unfortunately I cannot track the day they started (it is >> beyond my logs). >> >> Any ideas? >> >> Thank you! >> Martin >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com