Regarding your original issue, you may want to configure kdump on one of the machines to get more insight on what is happening when the box hangs/crashes.
I faced a similar issue when trying 4.4.8 on my Infernalis cluster (box hangs, black screen, OSD down and out), and as it happens, there were cases with similar traces [0][1]. I didn't have the time at the moment to run more tests so I went back to using stock 3.10. Also note that the default kdump behavior on kernel panic is to dump the kernel and restart the server. [0] https://lkml.org/lkml/2016/3/17/570 [1] https://lkml.org/lkml/2016/5/17/136 On Mon, Jun 20, 2016 at 4:12 AM, Adrian Saul <adrian.s...@tpgtelecom.com.au> wrote: > Hi All, > We have a Jewel (10.2.1) cluster on Centos 7 - I am using an elrepo > 4.4.1 kernel on all machines and we have an issue where some of the > machines hang - not sure if its hardware or OS but essentially the host > including the console is unresponsive and can only be recovered with a > hardware reset. Unfortunately nothing useful is logged so I am still > trying to figure out what is going on to cause this. But the result for > ceph is that if an OSD host goes down like this we have run into an issue > where only some of its OSDs are marked down. In the instance on the > weekend, the host had 8 OSDs and only 5 got marked as down - this lead to > the kRBD devices jamming up trying to send IO to non-responsive OSDs that > stayed marked up. > > The machine went into a slow death - lots of reports of slow or blocked > requests: > > 2016-06-19 09:37:49.070810 osd.36 10.145.2.15:6802/31359 65 : cluster > [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.297258 secs > 2016-06-19 09:37:54.071542 osd.36 10.145.2.15:6802/31359 82 : cluster > [WRN] 112 slow requests, 5 included below; oldest blocked for > 35.297988 > secs > 2016-06-19 09:37:54.071737 osd.6 10.145.2.15:6801/21836 221 : cluster > [WRN] 253 slow requests, 5 included below; oldest blocked for > 35.325155 > secs > 2016-06-19 09:37:59.072570 osd.6 10.145.2.15:6801/21836 251 : cluster > [WRN] 262 slow requests, 5 included below; oldest blocked for > 40.325986 > secs > > And then when the monitors did report them down the OSDs disputed that: > > 2016-06-19 09:38:35.821716 mon.0 10.145.2.13:6789/0 244970 : cluster > [INF] osd.6 10.145.2.15:6801/21836 failed (2 reporters from different > host after 20.000365 >= grace 20.000000) > 2016-06-19 09:38:36.950556 mon.0 10.145.2.13:6789/0 244978 : cluster > [INF] osd.22 10.145.2.15:6806/21826 failed (2 reporters from different > host after 21.613336 >= grace 20.000000) > 2016-06-19 09:38:36.951133 mon.0 10.145.2.13:6789/0 244980 : cluster > [INF] osd.31 10.145.2.15:6812/21838 failed (2 reporters from different > host after 21.613781 >= grace 20.836511) > 2016-06-19 09:38:36.951636 mon.0 10.145.2.13:6789/0 244982 : cluster > [INF] osd.36 10.145.2.15:6802/31359 failed (2 reporters from different > host after 21.614259 >= grace 20.000000) > > 2016-06-19 09:38:37.156088 osd.36 10.145.2.15:6802/31359 346 : cluster > [WRN] map e28730 wrongly marked me down > 2016-06-19 09:38:36.002076 osd.6 10.145.2.15:6801/21836 473 : cluster > [WRN] map e28729 wrongly marked me down > 2016-06-19 09:38:37.046885 osd.22 10.145.2.15:6806/21826 374 : cluster > [WRN] map e28730 wrongly marked me down > 2016-06-19 09:38:37.050635 osd.31 10.145.2.15:6812/21838 351 : cluster > [WRN] map e28730 wrongly marked me down > > But shortly after > > 2016-06-19 09:43:39.940985 mon.0 10.145.2.13:6789/0 245305 : cluster > [INF] osd.6 out (down for 303.951251) > 2016-06-19 09:43:39.941061 mon.0 10.145.2.13:6789/0 245306 : cluster > [INF] osd.22 out (down for 302.908528) > 2016-06-19 09:43:39.941099 mon.0 10.145.2.13:6789/0 245307 : cluster > [INF] osd.31 out (down for 302.908527) > 2016-06-19 09:43:39.941152 mon.0 10.145.2.13:6789/0 245308 : cluster > [INF] osd.36 out (down for 302.908527) > > 2016-06-19 10:09:10.648924 mon.0 10.145.2.13:6789/0 247076 : cluster > [INF] osd.23 10.145.2.15:6814/21852 failed (2 reporters from different > host after 20.000378 >= grace 20.000000) > 2016-06-19 10:09:10.887220 osd.23 10.145.2.15:6814/21852 176 : cluster > [WRN] map e28848 wrongly marked me down > 2016-06-19 10:14:15.160513 mon.0 10.145.2.13:6789/0 247422 : cluster > [INF] osd.23 out (down for 304.288018) > > By the time the issue was eventually escalated and I was able to do > something about it I manual marked the remaining host OSDs down (which > seemed to unclog RBD): > > 2016-06-19 15:25:06.171395 mon.0 10.145.2.13:6789/0 267212 : cluster > [INF] osd.7 10.145.2.15:6808/21837 failed (2 reporters from different > host after 22.000367 >= grace 20.000000) > 2016-06-19 15:25:06.171905 mon.0 10.145.2.13:6789/0 267214 : cluster > [INF] osd.24 10.145.2.15:6800/21813 failed (2 reporters from different > host after 22.000748 >= grace 20.710981) > 2016-06-19 15:25:06.172426 mon.0 10.145.2.13:6789/0 267216 : cluster > [INF] osd.37 10.145.2.15:6810/31936 failed (2 reporters from different > host after 22.001167 >= grace 20.000000) > > The question I have is why might the these 3 OSDs, despite not being > responsive for over 5 hours, stayed in the cluster? The CRUSH map for all > pools is to have the hosts as fault boundaries, so I would have expected > other host OSDs to be reporting these as unresponsive and reporting them. > On the OSD logs nothing was logged in the hour prior to the failure, and on > the other OSDs it seems like they noticed all the other OSDs timing out but > the 3 that stayed up it seemed to be actively attempting backfills. > > Any ideas on how I can improve detection of this condition? > > Cheers, > Adrian > > > Confidentiality: This email and any attachments are confidential and may > be subject to copyright, legal or some other professional privilege. They > are intended solely for the attention and use of the named addressee(s). > They may only be copied, distributed or disclosed with the consent of the > copyright owner. If you have received this email by mistake or by breach of > the confidentiality clause, please notify the sender immediately by return > email and delete or destroy all copies of the email. Any confidentiality, > privilege or copyright is not waived or lost because this email has been > sent to you by mistake. > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com