Hi All,

We are seeing the same problem here at Ruthford Appleton Laboratory:

During our patching against Stack Clash on our large physics data cluster, when 
rebooting the storage nodes about 8/36 OSD disks remount. We coaxed them to 
mount manually during the reboot campaign (see method below) but obviously we 
want a more long-term solution.

I believe this problem occurs as many of the OSD daemons are being started 
before the OSD disk is mounted.

From: [ceph-users] erratic startup of OSDs at reboot time, 2017-07-12, Graham 
Allan
We tried running: “udevadm trigger --subsystem-match=block --action=add” with 
occasional success but this wasn’t reliable.

From: [ceph-users] CentOS7 Mounting Problem, 2017-04-10, Jake Young
Interesting that running partprobe causes the OSD disk to mount and the OSD to 
start automatically. However, I don’t know why this would fix the problem for 
subsequent reboots.

Note: Interestingly, I had one example of this model of storage node (36 OSDs 
per host) in our development cluster (78 OSDs), over 5 reboots, all OSD disks 
mounted and the OSD processes started, so I am unable to reproduce the problem 
at small scale.

Best wishes,
Bruno

--------

Cluster:
5 MONs
1404 OSDs
39 storage nodes, 36 OSD disks per node connected to PCI via HBA

Software:
OS: SL7x
Ceph Release: kraken
Ceph Version: 11.2.0-0
Ceph Deploy Release: kraken
Ceph Deploy Version: 1.5.37-0

OSDs created as follows:
ceph-deploy disk zap $sn_fqdn:sdb
ceph-deploy --overwrite-conf config pull $sn_fqdn
ceph-deploy osd prepare $sn_fqdn:sdb

Coaxing method:
for srv in $(systemctl list-units -t service --full --no-pager -n0 | grep 
ceph-disk | awk '"'"'{print $2}'"'"'); do
    echo "Starting $srv" | ts
    systemctl start $srv
    sleep 1
done


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Willem 
Jan Withagen
Sent: 20 July 2017 19:06
To: Roger Brown; ceph-users
Subject: Re: [ceph-users] ceph-disk activate-block: not a block device

Hi Roger,

Device detection has recently changed (because FreeBSD does not have 
blockdevices).
So could very well be that this is an actual problem where something is still 
wrong.
Please keep an eye out, and let me know if it comes back.

--WjW
Op 20-7-2017 om 19:29 schreef Roger Brown:
So I disabled ceph-disk and will chalk it up as a red herring to ignore.


On Thu, Jul 20, 2017 at 11:02 AM Roger Brown 
<rogerpbr...@gmail.com<mailto:rogerpbr...@gmail.com>> wrote:
Also I'm just noticing osd1 is my only OSD host that even has an enabled target 
for ceph-disk (ceph-disk@dev-sdb2.service<mailto:ceph-disk@dev-sdb2.service>).

roger@osd1:~$ systemctl list-units ceph*
  UNIT                       LOAD   ACTIVE SUB     DESCRIPTION
● ceph-disk@dev-sdb2.service<mailto:ceph-disk@dev-sdb2.service> loaded failed 
failed  Ceph disk activation: /dev/sdb2
  ceph-osd@3.service<mailto:ceph-osd@3.service>         loaded active running 
Ceph object storage daemon osd.3
  ceph-mds.target            loaded active active  ceph target allowing to 
start/stop all ceph-mds@.service<mailto:ceph-mds@.service> instances at once
  ceph-mgr.target            loaded active active  ceph target allowing to 
start/stop all ceph-mgr@.service<mailto:ceph-mgr@.service> instances at once
  ceph-mon.target            loaded active active  ceph target allowing to 
start/stop all ceph-mon@.service<mailto:ceph-mon@.service> instances at once
  ceph-osd.target            loaded active active  ceph target allowing to 
start/stop all ceph-osd@.service<mailto:ceph-osd@.service> instances at once
  ceph-radosgw.target        loaded active active  ceph target allowing to 
start/stop all ceph-radosgw@.service<mailto:ceph-radosgw@.service> instances at 
once
  ceph.target                loaded active active  ceph target allowing to 
start/stop all ceph*@.service<mailto:ceph*@.service> instances at once

roger@osd2:~$ systemctl list-units ceph*
UNIT                LOAD   ACTIVE SUB     DESCRIPTION
ceph-osd@4.service<mailto:ceph-osd@4.service>  loaded active running Ceph 
object storage daemon osd.4
ceph-mds.target     loaded active active  ceph target allowing to start/stop 
all ceph-mds@.service<mailto:ceph-mds@.service> instances at once
ceph-mgr.target     loaded active active  ceph target allowing to start/stop 
all ceph-mgr@.service<mailto:ceph-mgr@.service> instances at once
ceph-mon.target     loaded active active  ceph target allowing to start/stop 
all ceph-mon@.service<mailto:ceph-mon@.service> instances at once
ceph-osd.target     loaded active active  ceph target allowing to start/stop 
all ceph-osd@.service<mailto:ceph-osd@.service> instances at once
ceph-radosgw.target loaded active active  ceph target allowing to start/stop 
all ceph-radosgw@.service<mailto:ceph-radosgw@.service> instances at once
ceph.target         loaded active active  ceph target allowing to start/stop 
all ceph*@.service<mailto:ceph*@.service> instances at once

roger@osd3:~$ systemctl list-units ceph*
UNIT                LOAD   ACTIVE SUB     DESCRIPTION
ceph-osd@0.service<mailto:ceph-osd@0.service>  loaded active running Ceph 
object storage daemon osd.0
ceph-mds.target     loaded active active  ceph target allowing to start/stop 
all ceph-mds@.service<mailto:ceph-mds@.service> instances at once
ceph-mgr.target     loaded active active  ceph target allowing to start/stop 
all ceph-mgr@.service<mailto:ceph-mgr@.service> instances at once
ceph-mon.target     loaded active active  ceph target allowing to start/stop 
all ceph-mon@.service<mailto:ceph-mon@.service> instances at once
ceph-osd.target     loaded active active  ceph target allowing to start/stop 
all ceph-osd@.service<mailto:ceph-osd@.service> instances at once
ceph-radosgw.target loaded active active  ceph target allowing to start/stop 
all ceph-radosgw@.service<mailto:ceph-radosgw@.service> instances at once
ceph.target         loaded active active  ceph target allowing to start/stop 
all ceph*@.service<mailto:ceph*@.service> instances at once


On Thu, Jul 20, 2017 at 10:23 AM Roger Brown 
<rogerpbr...@gmail.com<mailto:rogerpbr...@gmail.com>> wrote:
I think I need help with some OSD trouble. OSD daemons on two hosts started 
flapping. At length, I rebooted host osd1 (osd.3), but the OSD daemon still 
fails to start. Upon closer inspection, 
ceph-disk@dev-sdb2.service<mailto:ceph-disk@dev-sdb2.service> is failing to 
start due to, "Error: /dev/sdb2 is not a block device"

This is the command I see it failing to run:

roger@osd1:~$ sudo /usr/sbin/ceph-disk --verbose activate-block /dev/sdb2
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 9, in <module>
    load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5731, in run
    main(sys.argv[1:])
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5682, in main
    args.func(args)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5438, in 
<lambda>
    func=lambda args: main_activate_space(name, args),
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4160, in 
main_activate_space
    osd_uuid = get_space_osd_uuid(name, dev)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4115, in 
get_space_osd_uuid
    raise Error('%s is not a block device' % path)
ceph_disk.main.Error: Error: /dev/sdb2 is not a block device

osd1 environment:
$ ceph -v
ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
$ uname -r
4.4.0-83-generic
$ lsb_release -sc
xenial

Please advise.





_______________________________________________

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to