Thanks for testing. That should rule out udev as the cause of the race.

A couple of observations from the log:

* There is a loop for each osd that calls 'ceph-volume lvm trigger' 30 times 
until the OSD is activated, for example for 4:
[2019-05-31 01:27:29,235][ceph_volume.process][INFO  ] Running command: 
ceph-volume lvm trigger 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,435][ceph_volume.process][INFO  ] stderr -->  
RuntimeError: could not find osd.4 with fsid 
7478edfc-f321-40a2-a105-8e8a2c8ca3f6                        
[2019-05-31 01:27:35,530][systemd][WARNING] command returned non-zero exit 
status: 1                                                                       
                 
[2019-05-31 01:27:35,531][systemd][WARNING] failed activating OSD, retries 
left: 30                                                                        
   
[2019-05-31 01:27:44,122][ceph_volume.process][INFO  ] stderr -->  
RuntimeError: could not find osd.4 with fsid 
7478edfc-f321-40a2-a105-8e8a2c8ca3f6                        
[2019-05-31 01:27:44,174][systemd][WARNING] command returned non-zero exit 
status: 1                                                                 
[2019-05-31 01:27:44,175][systemd][WARNING] failed activating OSD, retries 
left: 29
...

I wonder if we can have similar 'ceph-volume lvm trigger' calls for WAL
and DB devices per OSD. Does that even make sense? Or perhaps another
call with a similar goal. We should be able to determine if an OSD has a
DB or WAL device from the lvm tags.

* The first 3 osd's that are activated are 18, 4, and 11 and they are the 3 
that are missing block.db/block.wal symlinks. That's just more confirmation 
this is a race:
[2019-05-31 01:28:03,370][systemd][INFO  ] successfully trggered activation 
for: 18-eb5270dc-1110-420f-947e-aab7fae299c9                     
[2019-05-31 01:28:12,354][systemd][INFO  ] successfully trggered activation 
for: 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6                                     
                
[2019-05-31 01:28:12,530][systemd][INFO  ] successfully trggered activation 
for: 11-33de740d-bd8c-4b47-a601-3e6e634e489a

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1828617

Title:
  Hosts randomly 'losing' disks, breaking ceph-osd service enumeration

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1828617/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to