We're running journals on NVMe as well - SLES before rebooting try deleting the links here: /etc/systemd/system/ceph-osd.target.wants/ if we delete first it boots ok if we don't delete the disks sometimes don't come up and we have to ceph-disk activate all
HTH Thanks Joe >>> David Turner <drakonst...@gmail.com> 9/15/2017 9:54 AM >>> I have this issue with my NVMe OSDs, but not my HDD OSDs. I have 15 HDD's and 2 NVMe's in each host. We put most of the journals on one of the NVMe's and a few on the second, but added a small OSD partition to the second NVMe for RGW metadata pools. When restarting a server manually for testing, the NVMe OSD comes back up normally. We're tracking a problem with the OSD nodes freezing and having to force reboot them. After this, the NVMe OSD doesn't come back on its own until I run `ceph-disk activate-all`. This seems to track with your theory that a non-clean FS is a part of the equation. Is there any ideas as to how to resolve this yet? So far being able to run `ceph-disk activate-all` is good enough, but a bit of a nuisance. On Fri, Sep 15, 2017 at 11:48 AM Matthew Vernon <m...@sanger.ac.uk> wrote: Hi, On 14/09/17 16:26, Götz Reinicke wrote: > maybe someone has a hint: I do have a cephalopod cluster (6 nodes, 144 > OSDs), Cents 7.3 ceph 10.2.7. > > I did a kernel update to the recent centos 7.3 one on a node and did a > reboot. > > After that, 10 OSDs did not came up as the others. The disk did not get > mounted and the OSD processes did nothing … even after a couple of > minutes no more disks/OSDs showed up. > > So I did a ceph-disk activate-all. > > And all missing OSDs got back online. > > Questions: Any hints on debugging why the disk did not get online after > the reboot? We've been seeing this on our Ubuntu / Jewel cluster, after we upgraded from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93. I'm still digging, but AFAICT it's a race condition in startup - in our case, we're only seeing it if some of the filesystems aren't clean. This may be related to the thread "Very slow start of osds after reboot" from August, but I don't think any conclusion was reached there. Regards, Matthew -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com