Re: [ceph-users] OSD out/down detection

Adrien Gillard Tue, 21 Jun 2016 07:47:15 -0700

Regarding your original issue, you may want to configure kdump on one of
the machines to get more insight on what is happening when the box
hangs/crashes.


I faced a similar issue when trying 4.4.8 on my Infernalis cluster (box
hangs, black screen, OSD down and out), and as it happens, there were cases
with similar traces [0][1].

I didn't have the time at the moment to run more tests so I went back to
using stock 3.10.

Also note that the default kdump behavior on kernel panic is to dump the
kernel and restart the server.


[0] https://lkml.org/lkml/2016/3/17/570
[1] https://lkml.org/lkml/2016/5/17/136

On Mon, Jun 20, 2016 at 4:12 AM, Adrian Saul <adrian.s...@tpgtelecom.com.au>
wrote:

> Hi All,
>  We have a Jewel (10.2.1) cluster on Centos 7 - I am using an  elrepo
> 4.4.1 kernel on all machines and we have an issue where some of the
> machines hang - not sure if its hardware or OS but essentially the host
> including the console is unresponsive and can only be recovered with a
> hardware reset.  Unfortunately nothing useful is logged so I am still
> trying to figure out what is going on to cause this.   But the result for
> ceph is that if an OSD host goes down like this we have run into an issue
> where only some of its OSDs are marked down.    In the instance on the
> weekend, the host had 8 OSDs and only 5 got marked as down - this lead to
> the kRBD devices jamming up trying to send IO to non-responsive OSDs that
> stayed marked up.
>
> The machine went into a slow death - lots of reports of slow or blocked
> requests:
>
> 2016-06-19 09:37:49.070810 osd.36 10.145.2.15:6802/31359 65 : cluster
> [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.297258 secs
> 2016-06-19 09:37:54.071542 osd.36 10.145.2.15:6802/31359 82 : cluster
> [WRN] 112 slow requests, 5 included below; oldest blocked for > 35.297988
> secs
> 2016-06-19 09:37:54.071737 osd.6 10.145.2.15:6801/21836 221 : cluster
> [WRN] 253 slow requests, 5 included below; oldest blocked for > 35.325155
> secs
> 2016-06-19 09:37:59.072570 osd.6 10.145.2.15:6801/21836 251 : cluster
> [WRN] 262 slow requests, 5 included below; oldest blocked for > 40.325986
> secs
>
> And then when the monitors did report them down the OSDs disputed that:
>
> 2016-06-19 09:38:35.821716 mon.0 10.145.2.13:6789/0 244970 : cluster
> [INF] osd.6 10.145.2.15:6801/21836 failed (2 reporters from different
> host after 20.000365 >= grace 20.000000)
> 2016-06-19 09:38:36.950556 mon.0 10.145.2.13:6789/0 244978 : cluster
> [INF] osd.22 10.145.2.15:6806/21826 failed (2 reporters from different
> host after 21.613336 >= grace 20.000000)
> 2016-06-19 09:38:36.951133 mon.0 10.145.2.13:6789/0 244980 : cluster
> [INF] osd.31 10.145.2.15:6812/21838 failed (2 reporters from different
> host after 21.613781 >= grace 20.836511)
> 2016-06-19 09:38:36.951636 mon.0 10.145.2.13:6789/0 244982 : cluster
> [INF] osd.36 10.145.2.15:6802/31359 failed (2 reporters from different
> host after 21.614259 >= grace 20.000000)
>
> 2016-06-19 09:38:37.156088 osd.36 10.145.2.15:6802/31359 346 : cluster
> [WRN] map e28730 wrongly marked me down
> 2016-06-19 09:38:36.002076 osd.6 10.145.2.15:6801/21836 473 : cluster
> [WRN] map e28729 wrongly marked me down
> 2016-06-19 09:38:37.046885 osd.22 10.145.2.15:6806/21826 374 : cluster
> [WRN] map e28730 wrongly marked me down
> 2016-06-19 09:38:37.050635 osd.31 10.145.2.15:6812/21838 351 : cluster
> [WRN] map e28730 wrongly marked me down
>
> But shortly after
>
> 2016-06-19 09:43:39.940985 mon.0 10.145.2.13:6789/0 245305 : cluster
> [INF] osd.6 out (down for 303.951251)
> 2016-06-19 09:43:39.941061 mon.0 10.145.2.13:6789/0 245306 : cluster
> [INF] osd.22 out (down for 302.908528)
> 2016-06-19 09:43:39.941099 mon.0 10.145.2.13:6789/0 245307 : cluster
> [INF] osd.31 out (down for 302.908527)
> 2016-06-19 09:43:39.941152 mon.0 10.145.2.13:6789/0 245308 : cluster
> [INF] osd.36 out (down for 302.908527)
>
> 2016-06-19 10:09:10.648924 mon.0 10.145.2.13:6789/0 247076 : cluster
> [INF] osd.23 10.145.2.15:6814/21852 failed (2 reporters from different
> host after 20.000378 >= grace 20.000000)
> 2016-06-19 10:09:10.887220 osd.23 10.145.2.15:6814/21852 176 : cluster
> [WRN] map e28848 wrongly marked me down
> 2016-06-19 10:14:15.160513 mon.0 10.145.2.13:6789/0 247422 : cluster
> [INF] osd.23 out (down for 304.288018)
>
> By the time the issue was eventually escalated and I was able to do
> something about it I manual marked the remaining host OSDs down (which
> seemed to unclog RBD):
>
> 2016-06-19 15:25:06.171395 mon.0 10.145.2.13:6789/0 267212 : cluster
> [INF] osd.7 10.145.2.15:6808/21837 failed (2 reporters from different
> host after 22.000367 >= grace 20.000000)
> 2016-06-19 15:25:06.171905 mon.0 10.145.2.13:6789/0 267214 : cluster
> [INF] osd.24 10.145.2.15:6800/21813 failed (2 reporters from different
> host after 22.000748 >= grace 20.710981)
> 2016-06-19 15:25:06.172426 mon.0 10.145.2.13:6789/0 267216 : cluster
> [INF] osd.37 10.145.2.15:6810/31936 failed (2 reporters from different
> host after 22.001167 >= grace 20.000000)
>
> The question I have is why might the these 3 OSDs, despite not being
> responsive for over 5 hours, stayed in the cluster?  The CRUSH map for all
> pools is to have the hosts as fault boundaries, so I would have expected
> other host OSDs to be reporting these as unresponsive and reporting them.
> On the OSD logs nothing was logged in the hour prior to the failure, and on
> the other OSDs it seems like they noticed all the other OSDs timing out but
> the 3 that stayed up it seemed to be actively attempting backfills.
>
> Any ideas on how I can improve detection of this condition?
>
> Cheers,
>  Adrian
>
>
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD out/down detection

Reply via email to