On Sun, Jan 20, 2019 at 11:30 PM Brian Topping <brian.topp...@gmail.com> wrote:
>
> Hi all, looks like I might have pooched something. Between the two nodes I 
> have, I moved all the PGs to one machine, reformatted the other machine, 
> rebuilt that machine, and moved the PGs back. In both cases, I did this by 
> taking the OSDs on the machine being moved from “out” and waiting for health 
> to be restored, then took them down.
>
> This worked great up to the point I had the mon/manager/rgw where they 
> started, all the OSDs/PGs on the other machine that had been rebuilt. The 
> next step was to rebuild the master machine, copy /etc/ceph and /var/lib/ceph 
> with cpio, then re-add new OSDs on the master machine as it were.
>
> This didn’t work so well. The master has come up just fine, but it’s not 
> connecting to the OSDs. Of the four OSDs, only two came up, and the other two 
> did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like 
> the following in it’s logs:
>
> > [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries 
> > left: 2
> > [2019-01-20 16:22:15,111][ceph_volume.process][INFO  ] Running command: 
> > /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
> > [2019-01-20 16:22:15,271][ceph_volume.process][INFO  ] stderr -->  
> > RuntimeError: could not find osd.1 with fsid 
> > e3bfc69e-a145-4e19-aac2-5f888e1ed2ce

When creating an OSD, ceph-volume will capture the ID and the FSID and
use these to create a systemd unit. When the system boots, it queries
LVM for devices that match that ID/FSID information.

Is it possible you've attempted to create an OSD and then failed, and
tried again? That would explain why there would be a systemd unit with
an FSID that doesn't match. By the output, it does look like
you have an OSD 1, but with a different FSID (467... instead of
e3b...). You could try to disable the failing systemd unit with:

    systemctl disable
ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service

(Follow up with OSD 3) and then run:

    ceph-volume lvm activate --all

Hopefully that can get you back into activated OSDs
>
>
> I see this for the volumes:
>
> > [root@gw02 ceph]# ceph-volume lvm list
> >
> > ====== osd.1 =======
> >
> >   [block]    
> > /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> >
> >       type                      block
> >       osd id                    1
> >       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> >       cluster name              ceph
> >       osd fsid                  4672bb90-8cea-4580-85f2-1e692811a05a
> >       encrypted                 0
> >       cephx lockbox secret
> >       block uuid                3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff
> >       block device              
> > /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> >       vdo                       0
> >       crush device class        None
> >       devices                   /dev/sda3
> >
> > ====== osd.3 =======
> >
> >   [block]    
> > /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> >
> >       type                      block
> >       osd id                    3
> >       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> >       cluster name              ceph
> >       osd fsid                  084cf33d-8a38-4c82-884a-7c88e3161479
> >       encrypted                 0
> >       cephx lockbox secret
> >       block uuid                PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7
> >       block device              
> > /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> >       vdo                       0
> >       crush device class        None
> >       devices                   /dev/sdb3
> >
> > ====== osd.5 =======
> >
> >   [block]    
> > /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> >
> >       type                      block
> >       osd id                    5
> >       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> >       cluster name              ceph
> >       osd fsid                  e854930d-1617-4fe7-b3cd-98ef284643fd
> >       encrypted                 0
> >       cephx lockbox secret
> >       block uuid                F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9
> >       block device              
> > /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> >       vdo                       0
> >       crush device class        None
> >       devices                   /dev/sdc3
> >
> > ====== osd.7 =======
> >
> >   [block]    
> > /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> >
> >       type                      block
> >       osd id                    7
> >       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
> >       cluster name              ceph
> >       osd fsid                  5c0d0404-390e-4801-94a9-da52c104206f
> >       encrypted                 0
> >       cephx lockbox secret
> >       block uuid                wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe
> >       block device              
> > /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> >       vdo                       0
> >       crush device class        None
> >       devices                   /dev/sdd3
>
> What I am wondering is if device mapper has lost something with a kernel or 
> library change:
>
> > [root@gw02 ceph]# ls -l /dev/dm*
> > brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0
> > brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> > brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> > brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> > brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> > [root@gw02 ~]# dmsetup ls
> > ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f
> >    (253:1)
> > ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479
> >    (253:4)
> > ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd
> >    (253:2)
> > hndc1.centos02-root   (253:0)
> > ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a
> >    (253:3)
>
> How can I debug this? I suspect this is just some kind of a UID swap that 
> that happened somewhere, but I don’t know what the chain of truth is through 
> the database files to connect the two together and make sure I have the 
> correct OSD blocks where the mon expects to find them.
>
> Thanks! Brian
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to