On Sun, Jan 20, 2019 at 11:30 PM Brian Topping <brian.topp...@gmail.com> wrote: > > Hi all, looks like I might have pooched something. Between the two nodes I > have, I moved all the PGs to one machine, reformatted the other machine, > rebuilt that machine, and moved the PGs back. In both cases, I did this by > taking the OSDs on the machine being moved from “out” and waiting for health > to be restored, then took them down. > > This worked great up to the point I had the mon/manager/rgw where they > started, all the OSDs/PGs on the other machine that had been rebuilt. The > next step was to rebuild the master machine, copy /etc/ceph and /var/lib/ceph > with cpio, then re-add new OSDs on the master machine as it were. > > This didn’t work so well. The master has come up just fine, but it’s not > connecting to the OSDs. Of the four OSDs, only two came up, and the other two > did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like > the following in it’s logs: > > > [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries > > left: 2 > > [2019-01-20 16:22:15,111][ceph_volume.process][INFO ] Running command: > > /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce > > [2019-01-20 16:22:15,271][ceph_volume.process][INFO ] stderr --> > > RuntimeError: could not find osd.1 with fsid > > e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
When creating an OSD, ceph-volume will capture the ID and the FSID and use these to create a systemd unit. When the system boots, it queries LVM for devices that match that ID/FSID information. Is it possible you've attempted to create an OSD and then failed, and tried again? That would explain why there would be a systemd unit with an FSID that doesn't match. By the output, it does look like you have an OSD 1, but with a different FSID (467... instead of e3b...). You could try to disable the failing systemd unit with: systemctl disable ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service (Follow up with OSD 3) and then run: ceph-volume lvm activate --all Hopefully that can get you back into activated OSDs > > > I see this for the volumes: > > > [root@gw02 ceph]# ceph-volume lvm list > > > > ====== osd.1 ======= > > > > [block] > > /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a > > > > type block > > osd id 1 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid 4672bb90-8cea-4580-85f2-1e692811a05a > > encrypted 0 > > cephx lockbox secret > > block uuid 3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff > > block device > > /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a > > vdo 0 > > crush device class None > > devices /dev/sda3 > > > > ====== osd.3 ======= > > > > [block] > > /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479 > > > > type block > > osd id 3 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid 084cf33d-8a38-4c82-884a-7c88e3161479 > > encrypted 0 > > cephx lockbox secret > > block uuid PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7 > > block device > > /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479 > > vdo 0 > > crush device class None > > devices /dev/sdb3 > > > > ====== osd.5 ======= > > > > [block] > > /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd > > > > type block > > osd id 5 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid e854930d-1617-4fe7-b3cd-98ef284643fd > > encrypted 0 > > cephx lockbox secret > > block uuid F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9 > > block device > > /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd > > vdo 0 > > crush device class None > > devices /dev/sdc3 > > > > ====== osd.7 ======= > > > > [block] > > /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f > > > > type block > > osd id 7 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid 5c0d0404-390e-4801-94a9-da52c104206f > > encrypted 0 > > cephx lockbox secret > > block uuid wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe > > block device > > /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f > > vdo 0 > > crush device class None > > devices /dev/sdd3 > > What I am wondering is if device mapper has lost something with a kernel or > library change: > > > [root@gw02 ceph]# ls -l /dev/dm* > > brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0 > > brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1 > > brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2 > > brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3 > > brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4 > > [root@gw02 ~]# dmsetup ls > > ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f > > (253:1) > > ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479 > > (253:4) > > ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd > > (253:2) > > hndc1.centos02-root (253:0) > > ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a > > (253:3) > > How can I debug this? I suspect this is just some kind of a UID swap that > that happened somewhere, but I don’t know what the chain of truth is through > the database files to connect the two together and make sure I have the > correct OSD blocks where the mon expects to find them. > > Thanks! Brian > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com