Hi all, looks like I might have pooched something. Between the two nodes I 
have, I moved all the PGs to one machine, reformatted the other machine, 
rebuilt that machine, and moved the PGs back. In both cases, I did this by 
taking the OSDs on the machine being moved from “out” and waiting for health to 
be restored, then took them down. 

This worked great up to the point I had the mon/manager/rgw where they started, 
all the OSDs/PGs on the other machine that had been rebuilt. The next step was 
to rebuild the master machine, copy /etc/ceph and /var/lib/ceph with cpio, then 
re-add new OSDs on the master machine as it were.

This didn’t work so well. The master has come up just fine, but it’s not 
connecting to the OSDs. Of the four OSDs, only two came up, and the other two 
did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like 
the following in it’s logs:

> [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries 
> left: 2
> [2019-01-20 16:22:15,111][ceph_volume.process][INFO  ] Running command: 
> /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
> [2019-01-20 16:22:15,271][ceph_volume.process][INFO  ] stderr -->  
> RuntimeError: could not find osd.1 with fsid 
> e3bfc69e-a145-4e19-aac2-5f888e1ed2ce


I see this for the volumes:

> [root@gw02 ceph]# ceph-volume lvm list 
> 
> ====== osd.1 =======
> 
>   [block]    
> /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> 
>       type                      block
>       osd id                    1
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  4672bb90-8cea-4580-85f2-1e692811a05a
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff
>       block device              
> /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sda3
> 
> ====== osd.3 =======
> 
>   [block]    
> /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> 
>       type                      block
>       osd id                    3
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  084cf33d-8a38-4c82-884a-7c88e3161479
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7
>       block device              
> /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sdb3
> 
> ====== osd.5 =======
> 
>   [block]    
> /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> 
>       type                      block
>       osd id                    5
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  e854930d-1617-4fe7-b3cd-98ef284643fd
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9
>       block device              
> /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sdc3
> 
> ====== osd.7 =======
> 
>   [block]    
> /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> 
>       type                      block
>       osd id                    7
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  5c0d0404-390e-4801-94a9-da52c104206f
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe
>       block device              
> /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sdd3

What I am wondering is if device mapper has lost something with a kernel or 
library change:

> [root@gw02 ceph]# ls -l /dev/dm*
> brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0
> brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> [root@gw02 ~]# dmsetup ls
> ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f
>    (253:1)
> ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479
>    (253:4)
> ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd
>    (253:2)
> hndc1.centos02-root   (253:0)
> ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a
>    (253:3)

How can I debug this? I suspect this is just some kind of a UID swap that that 
happened somewhere, but I don’t know what the chain of truth is through the 
database files to connect the two together and make sure I have the correct OSD 
blocks where the mon expects to find them.

Thanks! Brian
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to