Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

Massimo Sgaravatto Tue, 26 Feb 2019 00:56:10 -0800

On Mon, Feb 25, 2019 at 9:26 PM mart.v <mar...@seznam.cz> wrote:

> - As far as I understand the reported 'implicated osds' are only the
> primary ones. In the log of the osds you should find also the relevant pg
> number, and with this information you can get all the involved OSDs. This
> might be useful e.g. to see if a specific OSD node is always involved. This
> was my case (a the problem was with the patch cable connecting the node)
>
>
> I can see right from the REQUEST_SLOW error log lines implicated OSDs and
> therefore I can tell which nodes are involved. It is indeed on all nodes in
> a cluster, no exception. So it cannot be linked to one specific node.
>


I am afraid I was not clear enough. Suppose that ceph health detail reports
a slow request involving osd.14

In osd.14 log I see this line:

2019-02-24 16:58:39.475740 7fe25a84d700  0 log_channel(cluster) log [WRN] :
slow request 30.328572 seconds old, received at 2019-02-24 16:58:09.147037:
osd_op(client.148580771.0:476351313 8.1d6
8:6ba6a916:::rbd_data.ba32e7238e1f29.00000000000004b3:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 3776512~4096] snapc 0=[]
ondisk+write+known_if_redirected e1242718) currently op_applied

Here the pg_num is 8.1d6

# ceph pg map 8.1d6
osdmap e1247126 pg 8.1d6 (8.1d6) -> up [14,38,24] acting [14,38,24]
[root@ceph-osd-02 ceph]# ceph pg map 8.1d6

So the problem is not necessarily is osd.14. It could also in osd.38 or
osd.24, or in the relevant hosts


>
>
> - You can use the "ceph daemon osd.x dump_historic_ops" command to debug
> some of these slow requests (to see which events take much time)
>
>
> 2019-02-25 17:40:49.550303 > initiated
>
> 2019-02-25 17:40:49.550338 > queued_for_pg
>
> 2019-02-25 17:40:49.550924 > reached_pg
>
> 2019-02-25 17:40:49.550950 > started
>
> 2019-02-25 17:40:49.550989 > waiting for subops from 21,35
>
> 2019-02-25 17:40:49.552316 > op_commit
>
> 2019-02-25 17:40:49.552320 > op_applied
>
> 2019-02-25 17:40:49.553216 > sub_op_commit_rec from 21
>
> 2019-02-25 17:41:18.416662 > sub_op_commit_rec from 35
>
> 2019-02-25 17:41:18.416708 > commit_sent
>
> 2019-02-25 17:41:18.416726 > done
>
>
> I'm not sure how to read this output  - the time is start or finish? Does
> it mean that it is waiting for OSD 21 or 35? I tried to examine few
> different OSDs for dump_historic_ops, they all seems to wait on other OSDs.
> But there is no similarity (OSD numbers are different).
>
>
> As far as I understand In this case most of the time was waiting for an
answer from osd.35

PS: You might also want to have a look at the thread "Debugging 'slow
requests'" in this mailing list where Brad Hubbard (thanks again !) helped
me debugging a 'slow request' problem

Cheers, Massimo



Best,
>
> Martin
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time

Reply via email to