> On Jan 21, 2019, at 6:47 AM, Alfredo Deza <ad...@redhat.com> wrote:
>
> When creating an OSD, ceph-volume will capture the ID and the FSID and
> use these to create a systemd unit. When the system boots, it queries
> LVM for devices that match that ID/FSID information.
Thanks Alfredo, I see that now. The name comes from the symlink and is passed
into the script as %i. I should have seen that before, but at best I would have
done a hacky job of recreating them manually, so in hindsight I’m glad I did
not see that sooner.
> Is it possible you've attempted to create an OSD and then failed, and
> tried again? That would explain why there would be a systemd unit with
> an FSID that doesn't match. By the output, it does look like
> you have an OSD 1, but with a different FSID (467... instead of
> e3b...). You could try to disable the failing systemd unit with:
>
> systemctl disable
> ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service
> <mailto:ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service>
>
> (Follow up with OSD 3) and then run:
>
> ceph-volume lvm activate --all
That worked and recovered startup of all four OSDs on the second node. In an
abundance of caution, I only disabled one of the volumes with systemctl disable
and then ran ceph-volume lvm activate --all. That cleaned up all of them
though, so there was nothing left to do.
https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21
<https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21> helped resolve the
final issue getting to HEALTH_OK. After rebuilding the mon/mgr node, I did not
properly clear / restore the firewall. It’s odd that osd tree was reporting
that two of the OSDs were up and in when the ports for mon/mgr/mds were all
inaccessible.
I don’t believe there were any failed creation attempts. Cardinal process rule
with filesystems: Always maintain a known-good state that can be rolled back
to. If an error comes up that can’t be fully explained, roll back and restart.
Sometimes a command gets missed by the best of fingers and fully caffeinated
minds.. :) I do see that I didn’t do a `ceph osd purge` on the empty/downed
OSDs that were gracefully `out`. That explains the tree with the even numbered
OSDs on the rebuilt node. After purging the references to the empty OSDs and
re-adding the volumes, I am back to full health with all devices and OSDs up/in.
THANK YOU!!! :D
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com