On Mon, Feb 25, 2019 at 9:26 PM mart.v <mar...@seznam.cz> wrote: > - As far as I understand the reported 'implicated osds' are only the > primary ones. In the log of the osds you should find also the relevant pg > number, and with this information you can get all the involved OSDs. This > might be useful e.g. to see if a specific OSD node is always involved. This > was my case (a the problem was with the patch cable connecting the node) > > > I can see right from the REQUEST_SLOW error log lines implicated OSDs and > therefore I can tell which nodes are involved. It is indeed on all nodes in > a cluster, no exception. So it cannot be linked to one specific node. >
I am afraid I was not clear enough. Suppose that ceph health detail reports a slow request involving osd.14 In osd.14 log I see this line: 2019-02-24 16:58:39.475740 7fe25a84d700 0 log_channel(cluster) log [WRN] : slow request 30.328572 seconds old, received at 2019-02-24 16:58:09.147037: osd_op(client.148580771.0:476351313 8.1d6 8:6ba6a916:::rbd_data.ba32e7238e1f29.00000000000004b3:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3776512~4096] snapc 0=[] ondisk+write+known_if_redirected e1242718) currently op_applied Here the pg_num is 8.1d6 # ceph pg map 8.1d6 osdmap e1247126 pg 8.1d6 (8.1d6) -> up [14,38,24] acting [14,38,24] [root@ceph-osd-02 ceph]# ceph pg map 8.1d6 So the problem is not necessarily is osd.14. It could also in osd.38 or osd.24, or in the relevant hosts > > > - You can use the "ceph daemon osd.x dump_historic_ops" command to debug > some of these slow requests (to see which events take much time) > > > 2019-02-25 17:40:49.550303 > initiated > > 2019-02-25 17:40:49.550338 > queued_for_pg > > 2019-02-25 17:40:49.550924 > reached_pg > > 2019-02-25 17:40:49.550950 > started > > 2019-02-25 17:40:49.550989 > waiting for subops from 21,35 > > 2019-02-25 17:40:49.552316 > op_commit > > 2019-02-25 17:40:49.552320 > op_applied > > 2019-02-25 17:40:49.553216 > sub_op_commit_rec from 21 > > 2019-02-25 17:41:18.416662 > sub_op_commit_rec from 35 > > 2019-02-25 17:41:18.416708 > commit_sent > > 2019-02-25 17:41:18.416726 > done > > > I'm not sure how to read this output - the time is start or finish? Does > it mean that it is waiting for OSD 21 or 35? I tried to examine few > different OSDs for dump_historic_ops, they all seems to wait on other OSDs. > But there is no similarity (OSD numbers are different). > > > As far as I understand In this case most of the time was waiting for an answer from osd.35 PS: You might also want to have a look at the thread "Debugging 'slow requests'" in this mailing list where Brad Hubbard (thanks again !) helped me debugging a 'slow request' problem Cheers, Massimo Best, > > Martin > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com