A couple of hints to debug the issue (since I had to recently debug a
problem with the same symptoms):

- As far as I understand the reported 'implicated osds' are only the
primary ones. In the log of the osds you should find also the relevant pg
number, and with this information you can get all the involved OSDs. This
might be useful e.g. to see if a specific OSD node is always involved. This
was my case (a the problem was with the patch cable connecting the node)

- You can use the "ceph daemon osd.x dump_historic_ops" command to debug
some of these slow requests (to see which events take much time)

Cheers, Massimo

On Fri, Feb 22, 2019 at 10:28 AM mart.v <mar...@seznam.cz> wrote:

> Hello everyone,
>
> I'm experiencing a strange behaviour. My cluster is relatively small (43
> OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected
> via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with
> different pools. Descibed error is only on the SSD part of the cluster.
>
> I noticed that few times a day the cluster slows down a bit and I have
> discovered this in logs:
>
> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159
> : cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec.
> Implicated osds 10,22,33 (REQUEST_SLOW)
> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169
> : cluster [WRN] Health check update: 199 slow requests are blocked > 32
> sec. Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41
> (REQUEST_SLOW)
> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183
> : cluster [WRN] Health check update: 448 slow requests are blocked > 32
> sec. Implicated osds
> 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 (REQUEST_SLOW)
> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210
> : cluster [WRN] Health check update: 388 slow requests are blocked > 32
> sec. Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW)
> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214
> : cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests
> are blocked > 32 sec. Implicated osds 8,16)
>
> "ceph health detail" shows nothing more
>
> It is happening through the whole day and the times can't be linked to any
> read or write intensive task (e.g. backup). I also tried to disable
> scrubbing, but it kept on going. These errors were not there since
> beginning, but unfortunately I cannot track the day they started (it is
> beyond my logs).
>
> Any ideas?
>
> Thank you!
> Martin
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to