I have a 3 node test 17.2.7 cluster, and I decided to power one of the
hosts down which contained a mon, 6 osds and a standby mgr. Then I used the
`ceph orch host rm <host> --force --rm` to remove the host (after powering
it down).
All of this looks expected except the logs after removing the host. These
pools have size 3 min_size 2. Actually I'm unsure what would happen if I
were even to successfully add back the host as those objects would be
degraded. Besides the error from cephadm, what is ceph thinking when I
pulled the plug then removed that host? Am I guaranteed some type of
recovery if I do successfully add back that host?
----
Before removing host
root@ceph-test-2:/# ceph -s
cluster:
id: fca870d8-e431-11ef-8000-bc2411363b7d
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-test-2,ceph-test-3,ceph-test-4 (age 29m)
mgr: ceph-test-2.vbjhdq(active, since 31m), standbys: ceph-test-4.jjubsa
osd: 18 osds: 18 up (since 22m), 18 in (since 28m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 5 pools, 129 pgs
objects: 223 objects, 581 KiB
usage: 5.2 GiB used, 8.8 TiB / 8.8 TiB avail
pgs: 129 active+clean
root@ceph-test-2:/# ceph orch host ls
HOST ADDR LABELS STATUS
ceph-test-2 10.0.0.52 _admin,rgw
ceph-test-3 10.0.0.53
ceph-test-4 10.0.0.54
-------------
After removing host
# ceph orch host rm ceph-test-4 --offline --force
Removed offline host 'ceph-test-4'
# ceph orch ps
mon.ceph-test-4 ceph-test-4 stopped
osd.1 ceph-test-4 error
osd.7 ceph-test-4 error
....
....
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.85956 root default
-5 2.92978 host ceph-test-2
2 hdd 0.48830 osd.2 up 1.00000 1.00000
4 hdd 0.48830 osd.4 up 1.00000 1.00000
8 hdd 0.48830 osd.8 up 1.00000 1.00000
11 hdd 0.48830 osd.11 up 1.00000 1.00000
14 hdd 0.48830 osd.14 up 1.00000 1.00000
16 hdd 0.48830 osd.16 up 1.00000 1.00000
-3 2.92978 host ceph-test-3
0 hdd 0.48830 osd.0 up 1.00000 1.00000
3 hdd 0.48830 osd.3 up 1.00000 1.00000
6 hdd 0.48830 osd.6 up 1.00000 1.00000
9 hdd 0.48830 osd.9 up 1.00000 1.00000
12 hdd 0.48830 osd.12 up 1.00000 1.00000
15 hdd 0.48830 osd.15 up 1.00000 1.00000
root@ceph-test-2:/rootfs/root# ceph -s
cluster:
id: fca870d8-e431-11ef-8000-bc2411363b7d
health: HEALTH_WARN
6 failed cephadm daemon(s)
Degraded data redundancy: 145/669 objects degraded (21.674%),
24 pgs degraded, 71 pgs undersized
services:
mon: 2 daemons, quorum ceph-test-2,ceph-test-3 (age 50m)
mgr: ceph-test-2.vbjhdq(active, since 2h), standbys: ceph-test-3.wzmioq
osd: 12 osds: 12 up (since 51m), 12 in (since 2h); 58 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 5 pools, 129 pgs
objects: 223 objects, 581 KiB
usage: 3.5 GiB used, 5.9 TiB / 5.9 TiB avail
pgs: 145/669 objects degraded (21.674%)
75/669 objects misplaced (11.211%)
54 active+clean+remapped
47 active+undersized
24 active+undersized+degraded
4 active+clean
progress:
Global Recovery Event (50m)
[================............] (remaining: 37m)
The cephadm logs show this as well
2025-02-06T06:27:46.936+0000 7f85026a4700 -1 log_channel(cephadm) log [ERR]
: auth get failed: failed to find osd.7 in keyring retval: -2
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1002, in _check_daemons
self.mgr._daemon_action(daemon_spec, action=action)
File "/usr/share/ceph/mgr/cephadm/module.py", line 2136, in _daemon_action
daemon_spec)
File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 494,
in generate_config
extra_ceph_config=daemon_spec.ceph_conf)
File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 520,
in get_config_and_keyring
'entity': entity,
File "/usr/share/ceph/mgr/mgr_module.py", line 1593, in check_mon_command
raise MonCommandFailed(f'{cmd_dict["prefix"]} failed: {r.stderr}
retval: {r.retval}')
mgr_module.MonCommandFailed: auth get failed: failed to find osd.7 in
keyring retval: -2
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]