We're running journals on NVMe as well - SLES 
 
before rebooting try deleting the links here:
 /etc/systemd/system/ceph-osd.target.wants/
 
if we delete first it boots ok
if we don't delete the disks sometimes don't come up and we have to
ceph-disk activate all

HTH
 
Thanks Joe

>>> David Turner <drakonst...@gmail.com> 9/15/2017 9:54 AM >>>
I have this issue with my NVMe OSDs, but not my HDD OSDs. I have 15
HDD's and 2 NVMe's in each host. We put most of the journals on one of
the NVMe's and a few on the second, but added a small OSD partition to
the second NVMe for RGW metadata pools.

When restarting a server manually for testing, the NVMe OSD comes back
up normally. We're tracking a problem with the OSD nodes freezing and
having to force reboot them. After this, the NVMe OSD doesn't come back
on its own until I run `ceph-disk activate-all`. This seems to track
with your theory that a non-clean FS is a part of the equation.

Is there any ideas as to how to resolve this yet? So far being able to
run `ceph-disk activate-all` is good enough, but a bit of a nuisance.

On Fri, Sep 15, 2017 at 11:48 AM Matthew Vernon <m...@sanger.ac.uk>
wrote:


Hi,

On 14/09/17 16:26, Götz Reinicke wrote:

> maybe someone has a hint: I do have a cephalopod cluster (6 nodes,
144
> OSDs), Cents 7.3 ceph 10.2.7.
>
> I did a kernel update to the recent centos 7.3 one on a node and did
a
> reboot.
>
> After that, 10 OSDs did not came up as the others. The disk did not
get
> mounted and the OSD processes did nothing … even after a couple of
> minutes no more disks/OSDs showed up.
>
> So I did a ceph-disk activate-all.
>
> And all missing OSDs got back online.
>
> Questions: Any hints on debugging why the disk did not get online
after
> the reboot?

We've been seeing this on our Ubuntu / Jewel cluster, after we
upgraded
from ceph 10.2.3 / kernel 4.4.0-62 to ceph 10.2.7 / kernel 4.4.0-93.

I'm still digging, but AFAICT it's a race condition in startup - in
our
case, we're only seeing it if some of the filesystems aren't clean.
This
may be related to the thread "Very slow start of osds after reboot"
from
August, but I don't think any conclusion was reached there.

Regards,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to