Hi All, We are seeing the same problem here at Ruthford Appleton Laboratory:
During our patching against Stack Clash on our large physics data cluster, when rebooting the storage nodes about 8/36 OSD disks remount. We coaxed them to mount manually during the reboot campaign (see method below) but obviously we want a more long-term solution. I believe this problem occurs as many of the OSD daemons are being started before the OSD disk is mounted. From: [ceph-users] erratic startup of OSDs at reboot time, 2017-07-12, Graham Allan We tried running: “udevadm trigger --subsystem-match=block --action=add” with occasional success but this wasn’t reliable. From: [ceph-users] CentOS7 Mounting Problem, 2017-04-10, Jake Young Interesting that running partprobe causes the OSD disk to mount and the OSD to start automatically. However, I don’t know why this would fix the problem for subsequent reboots. Note: Interestingly, I had one example of this model of storage node (36 OSDs per host) in our development cluster (78 OSDs), over 5 reboots, all OSD disks mounted and the OSD processes started, so I am unable to reproduce the problem at small scale. Best wishes, Bruno -------- Cluster: 5 MONs 1404 OSDs 39 storage nodes, 36 OSD disks per node connected to PCI via HBA Software: OS: SL7x Ceph Release: kraken Ceph Version: 11.2.0-0 Ceph Deploy Release: kraken Ceph Deploy Version: 1.5.37-0 OSDs created as follows: ceph-deploy disk zap $sn_fqdn:sdb ceph-deploy --overwrite-conf config pull $sn_fqdn ceph-deploy osd prepare $sn_fqdn:sdb Coaxing method: for srv in $(systemctl list-units -t service --full --no-pager -n0 | grep ceph-disk | awk '"'"'{print $2}'"'"'); do echo "Starting $srv" | ts systemctl start $srv sleep 1 done From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Willem Jan Withagen Sent: 20 July 2017 19:06 To: Roger Brown; ceph-users Subject: Re: [ceph-users] ceph-disk activate-block: not a block device Hi Roger, Device detection has recently changed (because FreeBSD does not have blockdevices). So could very well be that this is an actual problem where something is still wrong. Please keep an eye out, and let me know if it comes back. --WjW Op 20-7-2017 om 19:29 schreef Roger Brown: So I disabled ceph-disk and will chalk it up as a red herring to ignore. On Thu, Jul 20, 2017 at 11:02 AM Roger Brown <rogerpbr...@gmail.com<mailto:rogerpbr...@gmail.com>> wrote: Also I'm just noticing osd1 is my only OSD host that even has an enabled target for ceph-disk (ceph-disk@dev-sdb2.service<mailto:ceph-disk@dev-sdb2.service>). roger@osd1:~$ systemctl list-units ceph* UNIT LOAD ACTIVE SUB DESCRIPTION ● ceph-disk@dev-sdb2.service<mailto:ceph-disk@dev-sdb2.service> loaded failed failed Ceph disk activation: /dev/sdb2 ceph-osd@3.service<mailto:ceph-osd@3.service> loaded active running Ceph object storage daemon osd.3 ceph-mds.target loaded active active ceph target allowing to start/stop all ceph-mds@.service<mailto:ceph-mds@.service> instances at once ceph-mgr.target loaded active active ceph target allowing to start/stop all ceph-mgr@.service<mailto:ceph-mgr@.service> instances at once ceph-mon.target loaded active active ceph target allowing to start/stop all ceph-mon@.service<mailto:ceph-mon@.service> instances at once ceph-osd.target loaded active active ceph target allowing to start/stop all ceph-osd@.service<mailto:ceph-osd@.service> instances at once ceph-radosgw.target loaded active active ceph target allowing to start/stop all ceph-radosgw@.service<mailto:ceph-radosgw@.service> instances at once ceph.target loaded active active ceph target allowing to start/stop all ceph*@.service<mailto:ceph*@.service> instances at once roger@osd2:~$ systemctl list-units ceph* UNIT LOAD ACTIVE SUB DESCRIPTION ceph-osd@4.service<mailto:ceph-osd@4.service> loaded active running Ceph object storage daemon osd.4 ceph-mds.target loaded active active ceph target allowing to start/stop all ceph-mds@.service<mailto:ceph-mds@.service> instances at once ceph-mgr.target loaded active active ceph target allowing to start/stop all ceph-mgr@.service<mailto:ceph-mgr@.service> instances at once ceph-mon.target loaded active active ceph target allowing to start/stop all ceph-mon@.service<mailto:ceph-mon@.service> instances at once ceph-osd.target loaded active active ceph target allowing to start/stop all ceph-osd@.service<mailto:ceph-osd@.service> instances at once ceph-radosgw.target loaded active active ceph target allowing to start/stop all ceph-radosgw@.service<mailto:ceph-radosgw@.service> instances at once ceph.target loaded active active ceph target allowing to start/stop all ceph*@.service<mailto:ceph*@.service> instances at once roger@osd3:~$ systemctl list-units ceph* UNIT LOAD ACTIVE SUB DESCRIPTION ceph-osd@0.service<mailto:ceph-osd@0.service> loaded active running Ceph object storage daemon osd.0 ceph-mds.target loaded active active ceph target allowing to start/stop all ceph-mds@.service<mailto:ceph-mds@.service> instances at once ceph-mgr.target loaded active active ceph target allowing to start/stop all ceph-mgr@.service<mailto:ceph-mgr@.service> instances at once ceph-mon.target loaded active active ceph target allowing to start/stop all ceph-mon@.service<mailto:ceph-mon@.service> instances at once ceph-osd.target loaded active active ceph target allowing to start/stop all ceph-osd@.service<mailto:ceph-osd@.service> instances at once ceph-radosgw.target loaded active active ceph target allowing to start/stop all ceph-radosgw@.service<mailto:ceph-radosgw@.service> instances at once ceph.target loaded active active ceph target allowing to start/stop all ceph*@.service<mailto:ceph*@.service> instances at once On Thu, Jul 20, 2017 at 10:23 AM Roger Brown <rogerpbr...@gmail.com<mailto:rogerpbr...@gmail.com>> wrote: I think I need help with some OSD trouble. OSD daemons on two hosts started flapping. At length, I rebooted host osd1 (osd.3), but the OSD daemon still fails to start. Upon closer inspection, ceph-disk@dev-sdb2.service<mailto:ceph-disk@dev-sdb2.service> is failing to start due to, "Error: /dev/sdb2 is not a block device" This is the command I see it failing to run: roger@osd1:~$ sudo /usr/sbin/ceph-disk --verbose activate-block /dev/sdb2 Traceback (most recent call last): File "/usr/sbin/ceph-disk", line 9, in <module> load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5731, in run main(sys.argv[1:]) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5682, in main args.func(args) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5438, in <lambda> func=lambda args: main_activate_space(name, args), File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4160, in main_activate_space osd_uuid = get_space_osd_uuid(name, dev) File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4115, in get_space_osd_uuid raise Error('%s is not a block device' % path) ceph_disk.main.Error: Error: /dev/sdb2 is not a block device osd1 environment: $ ceph -v ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) $ uname -r 4.4.0-83-generic $ lsb_release -sc xenial Please advise. _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com