Re: [ceph-users] Problem with OSDs

Brian Topping Mon, 21 Jan 2019 13:06:11 -0800

> On Jan 21, 2019, at 6:47 AM, Alfredo Deza <ad...@redhat.com> wrote:
> 
> When creating an OSD, ceph-volume will capture the ID and the FSID and
> use these to create a systemd unit. When the system boots, it queries
> LVM for devices that match that ID/FSID information.


Thanks Alfredo, I see that now. The name comes from the symlink and is passed 
into the script as %i. I should have seen that before, but at best I would have 
done a hacky job of recreating them manually, so in hindsight I’m glad I did 
not see that sooner.

> Is it possible you've attempted to create an OSD and then failed, and
> tried again? That would explain why there would be a systemd unit with
> an FSID that doesn't match. By the output, it does look like
> you have an OSD 1, but with a different FSID (467... instead of
> e3b...). You could try to disable the failing systemd unit with:
> 
>    systemctl disable
> ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service 
> <mailto:ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service>
> 
> (Follow up with OSD 3) and then run:
> 
>    ceph-volume lvm activate --all

That worked and recovered startup of all four OSDs on the second node. In an 
abundance of caution, I only disabled one of the volumes with systemctl disable 
and then ran ceph-volume lvm activate --all. That cleaned up all of them 
though, so there was nothing left to do.

https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21 
<https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21> helped resolve the 
final issue getting to HEALTH_OK. After rebuilding the mon/mgr node, I did not 
properly clear / restore the firewall. It’s odd that osd tree was reporting 
that two of the OSDs were up and in when the ports for mon/mgr/mds were all 
inaccessible.

I don’t believe there were any failed creation attempts. Cardinal process rule 
with filesystems: Always maintain a known-good state that can be rolled back 
to. If an error comes up that can’t be fully explained, roll back and restart. 
Sometimes a command gets missed by the best of fingers and fully caffeinated 
minds.. :)  I do see that I didn’t do a `ceph osd purge` on the empty/downed 
OSDs that were gracefully `out`. That explains the tree with the even numbered 
OSDs on the rebuilt node. After purging the references to the empty OSDs and 
re-adding the volumes, I am back to full health with all devices and OSDs up/in.

THANK YOU!!! :D

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problem with OSDs

Reply via email to