I’ve had a similar experience with Reef, trying to destroy an improperly deployed OSD on a viable drive. I had to ceph-volume lvm zap to get past the purge and at no point was the OSD marked as destroyed.
> On Apr 30, 2025, at 6:15 AM, Eugen Block <ebl...@nde.ag> wrote: > > Right, 'ceph osd destroy' most likely won't help you here. The --replace > flag is only to mark an OSD as destroyed (so it will reuse its ID after > replacing the drive). > You wrote that stopping osd rm for 253 unblocked the upgrade, so the cluster > is currently upgrading? > > To clear the pending state, I would stop rm for the other OSD as well since > it's already out and down anyway. You can always zap a drive, either directly > on the host with: > > cephadm ceph-volume lvm zap --destroy /dev/sdX > > Or using the orchestrator: > > orch device zap <hostname> <path> [--force] > > But just to clarify, OSD.381 is already the replacement disk for a previously > failed drive? If you zap it, the orchestrator would try to apply any matching > spec and create a new OSD, probably with ID 381 again. > > Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: > >> Frédéric, >> >> My situation is a bit different I think. I had two malfunctionning OSDs that >> I removed with `ceph orch osd rm --replace --zap`: 1 was really dead and no >> longer seen by the OS (osd.253) and the other one with a lot of HW errors >> but still there (osd.381). Both have been successfully marked as destroyed >> in the CRUSH map. Just I didn't realize that cephadm was retrying every 10s >> to zap osd.253, getting an error as the disk could not be found. Looking at >> the removal status this morning(the removal was done ~2 weeks ago) with >> 'ceph orch osd rm status', I got: >> >> OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN >> STARTED AT >> 253 dig-osd4 done, waiting for purge 0 True False True >> 381 dig-osd6 done, waiting for purge 0 True False False >> 2025-04-23 11:56:09.864724+00:00 >> >> I don't know what the status "waiting to purge" means... but we can see that >> cephadm considers that the drain never started for osd.253, as the device >> was unavailable I guess... What happens with dig-osd6 is less clear to me >> but may be the consequence that at some point the disk freed by the initial >> rm was picked up by cephadm and readded as the replacement OSD because we >> forgot to set the osd.all-available-devices service to unmanaged. The drain >> started on Apr 23 is the second rm I did after fixing >> osd.all-available-devices service. For this second attempt, I didn't specify >> --zap, not sure why ( a mistake!). >> >> I have the feeling but I may be wrong, that 'ceph osd destroy' will not help >> as they are already marked destroyed in the CRUSH map... >> >> I'm wondering wether I should do 'ceph orch osd rm stop 381' as I did for >> 253 or whether it will impact the replacement later. Or said in a different >> way, is the replace flag something managed by cephadm and requiring the OSD >> to stay in the "rm queue" until the replacement is done? >> >> Best regards, >> >> Michel >> >>> Le 30/04/2025 à 10:50, Frédéric Nass a écrit : >>> Hi Michel, >>> >>> I've seen this recently on Reef (OSD stuck in the rm queue with the >>> orchestrator tryng to zap a device that had already been zapped). >>> >>> I could reproduce this a few times by deleting a batch of OSDs running on >>> the same node. The whole 'ceph orch osd rm' process would stop progressing >>> when trying to remove the ~8th OSD. I suspect that ceph-volume or the >>> orchestrator is misinformed at some point that the device has already been >>> zapped, looping over and over trying to remove this device that doesn't >>> exist anymore. >>> >>> I think you should now run 'ceph osd destroy <OSD_ID> >>> --yes-i-really-mean-it'. >>> >>> Regards, >>> Frédéric. >>> >>> ----- Le 30 Avr 25, à 10:28, Michel Jouvin michel.jou...@ijclab.in2p3.fr a >>> écrit : >>> >>>> Eugen, >>>> >>>> Thanks, I forgot that operation started with the orchestrator can be >>>> stopped. You were right: stopping the 'osd rm' was enough to unblock the >>>> upgrade. I am not completely sure what is the consequence on the replace >>>> flag: I have the feeling it has been lost somehow as the OSD is no >>>> longer listed by 'ceph orch osd rm status' and 'ceph -s' reports now one >>>> OSD down and 1 stray daemon instead of 2 stray daemons. >>>> >>>> Michel >>>> >>>> Le 30/04/2025 à 09:24, Eugen Block a écrit : >>>>> You can stop the osd removal: >>>>> >>>>> ceph orch osd rm stop <OSD_ID> >>>>> >>>>> I'm not entirely sure what the orchestrator will do except for >>>>> clearing the pending state, and since the OSDs are already marked as >>>>> destroyed in the crush tree, I wouldn't expect anything weird. But >>>>> it's worth a try, I guess. >>>>> >>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I had no time to really investigate more our problem yesterday. But I >>>>>> realized one issue that may explain the problem with osd.253: the >>>>>> underlying disk is so dead that it is no longer visible by the OS. >>>>>> Probably I added --zap when I did the 'ceph orch osd rm' and thus it >>>>>> is trying to do the zapping, fails as it doesn't find the disk and >>>>>> retries indefinitely... I remain a little bit surprise that this >>>>>> zapping error is not reported (without the traceback) at the INFO >>>>>> level and requires DEBUG to be seen but it is a detail. I'm surprised >>>>>> that Ceph is not giving up on zapping if it cannot access the device >>>>>> or did I miss something and there is a way to stop this process? >>>>>> >>>>>> May be it is a corner case that has been fixed/improved since >>>>>> 18.2.2... Anyway, the question remains: is there a way out of this >>>>>> problem (that seems the only reason for the upgrade not really >>>>>> starting) apart from getting the replacement device? >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Michel >>>>>> >>>>>> Le 28/04/2025 à 18:19, Michel Jouvin a écrit : >>>>>>> Hi Frédéric, >>>>>>> >>>>>>> Thanks for the command. I'm always looking at the wrong page of the >>>>>>> doc! I looked at >>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/ >>>>>>> that list the Ceph subsystem and their default log level but there >>>>>>> is no mention of cephadm there... After enabling cephadm debug log >>>>>>> level and restarting the upgrade, I got the messages below. The only >>>>>>> thing strange points to the problem with osd.253 where it tries to >>>>>>> zap the device that was probably already zapped and thus cannot find >>>>>>> the LV volume associated with osd.253. There is not really any other >>>>>>> messages saying the impact on the upgrade but I guess it is the >>>>>>> reason. What do you think ? And is there any way to fix it, other >>>>>>> than replacing the OSD? >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Michel >>>>>>> >>>>>>> --------------------- cephadm debug level log ------------------------- >>>>>>> >>>>>>> 2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade: >>>>>>> Started with target quay.io/ceph/ceph:v18.2.6 >>>>>>> 2025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshed >>>>>>> host dig-osd4 devices (23) >>>>>>> 2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] Finding >>>>>>> OSDSpecs for host: <dig-osd4> >>>>>>> 2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] Generating >>>>>>> OSDSpec previews for [] >>>>>>> 2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] Loading >>>>>>> OSDSpec previews to HostCache for host <dig-osd4> >>>>>>> 2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'config generate-minimal-conf' -> 0 in 0.005s >>>>>>> 2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'auth get' -> 0 in 0.277s >>>>>>> 2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combine >>>>>>> hosts with existing daemons [] + new hosts.... (very long line) >>>>>>> >>>>>>> 2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> _update_paused_health >>>>>>> 2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd df' -> 0 in 0.064s >>>>>>> 2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd df' -> 0 in 0.320s >>>>>>> 2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs are >>>>>>> scheduled for removal: [osd.381, osd.253] >>>>>>> 2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd df' -> 0 in 0.011s >>>>>>> 2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s >>>>>>> 2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>>>>> safe-to-destroy returns: >>>>>>> 2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] running >>>>>>> cmd: osd down on ids [osd.381] >>>>>>> 2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd down' -> 0 in 0.001s >>>>>>> 2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381 >>>>>>> now down >>>>>>> 2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>>>>> osd.381 on dig-osd6 was already removed >>>>>>> 2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s >>>>>>> 2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>>>>> destroy-actual returns: >>>>>>> 2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF] >>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement >>>>>>> 2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removing >>>>>>> osd.381 from the queue. >>>>>>> 2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd df' -> 0 in 0.010s >>>>>>> 2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s >>>>>>> 2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>>>>> safe-to-destroy returns: >>>>>>> 2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] running >>>>>>> cmd: osd down on ids [osd.253] >>>>>>> 2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd down' -> 0 in 0.000s >>>>>>> 2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253 >>>>>>> now down >>>>>>> 2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>>>>> osd.253 on dig-osd4 was already removed >>>>>>> 2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s >>>>>>> 2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>>>>> destroy-actual returns: >>>>>>> 2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF] >>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement >>>>>>> 2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zapping >>>>>>> devices for osd.253 on dig-osd4 >>>>>>> 2025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> _run_cephadm : command = ceph-volume >>>>>>> 2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG] >>>>>>> _run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253', >>>>>>> '--destroy'] >>>>>>> 2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osd >>>>>>> container image >>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> 2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args: >>>>>>> --image >>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> --timeout 895 ceph-volume --fsid >>>>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy >>>>>>> 2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Running >>>>>>> command: which python3 >>>>>>> 2025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Running >>>>>>> command: /usr/bin/python3 >>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d >>>>>>> --image >>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> --timeout 895 ceph-volume --fsid >>>>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy >>>>>>> 2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1 >>>>>>> 2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err: >>>>>>> Inferring config >>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.conf >>>>>>> Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host >>>>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume >>>>>>> --privileged --group-add=disk --init -e >>>>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e >>>>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v >>>>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z >>>>>>> -v >>>>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z >>>>>>> -v >>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z >>>>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v >>>>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v >>>>>>> /run/lock/lvm:/run/lock/lvm -v >>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro >>>>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v >>>>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z >>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> lvm zap --osd-id 253 --destroy >>>>>>> /usr/bin/podman: stderr Traceback (most recent call last): >>>>>>> /usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 11, in >>>>>>> <module> >>>>>>> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0', >>>>>>> 'console_scripts', 'ceph-volume')() >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in >>>>>>> __init__ >>>>>>> /usr/bin/podman: stderr self.main(self.argv) >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line >>>>>>> 59, in newfunc >>>>>>> /usr/bin/podman: stderr return f(*a, **kw) >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in >>>>>>> main >>>>>>> /usr/bin/podman: stderr terminal.dispatch(self.mapper, >>>>>>> subcommand_args) >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line >>>>>>> 194, in dispatch >>>>>>> /usr/bin/podman: stderr instance.main() >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", >>>>>>> line 46, in main >>>>>>> /usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv) >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line >>>>>>> 194, in dispatch >>>>>>> /usr/bin/podman: stderr instance.main() >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", >>>>>>> line 403, in main >>>>>>> /usr/bin/podman: stderr self.zap_osd() >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line >>>>>>> 16, in is_root >>>>>>> /usr/bin/podman: stderr return func(*a, **kw) >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", >>>>>>> line 301, in zap_osd >>>>>>> /usr/bin/podman: stderr devices = >>>>>>> find_associated_devices(self.args.osd_id, self.args.osd_fsid) >>>>>>> /usr/bin/podman: stderr File >>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", >>>>>>> line 88, in find_associated_devices >>>>>>> /usr/bin/podman: stderr '%s' % osd_id or osd_fsid) >>>>>>> /usr/bin/podman: stderr RuntimeError: Unable to find any LV for >>>>>>> zapping OSD: 253 >>>>>>> Traceback (most recent call last): >>>>>>> File "/usr/lib64/python3.9/runpy.py", line 197, in >>>>>>> _run_module_as_main >>>>>>> return _run_code(code, main_globals, None, >>>>>>> File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code >>>>>>> exec(code, run_globals) >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 10700, in <module> >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 10688, in main >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 2445, in _infer_config >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 2361, in _infer_fsid >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 2473, in _infer_image >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 2348, in _validate_fsid >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 6970, in command_ceph_volume >>>>>>> File >>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>>>>> line 2136, in call_throws >>>>>>> RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host >>>>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume >>>>>>> --privileged --group-add=disk --init -e >>>>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e >>>>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v >>>>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z >>>>>>> -v >>>>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z >>>>>>> -v >>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z >>>>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v >>>>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v >>>>>>> /run/lock/lvm:/run/lock/lvm -v >>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro >>>>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v >>>>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z >>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>>>>> lvm zap --osd-id 253 --destroy >>>>>>> 2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loop >>>>>>> sleep >>>>>>> >>>>>>> ----------------------- >>>>>>> >>>>>>> >>>>>>> Le 28/04/2025 à 14:00, Frédéric Nass a écrit : >>>>>>>> Hi Michel, >>>>>>>> >>>>>>>> You need to turn on cephadm debugging as described here [1] in the >>>>>>>> documentation >>>>>>>> >>>>>>>> $ ceph config set mgr mgr/cephadm/log_to_cluster_level debug >>>>>>>> >>>>>>>> and then look for any hints with >>>>>>>> >>>>>>>> $ ceph -W cephadm --watch-debug >>>>>>>> >>>>>>>> or >>>>>>>> >>>>>>>> $ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the >>>>>>>> active MGR) >>>>>>>> >>>>>>>> when you start/stop the upgrade. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Frédéric. >>>>>>>> >>>>>>>> [1] https://docs.ceph.com/en/reef/cephadm/operations/ >>>>>>>> >>>>>>>> ----- Le 28 Avr 25, à 12:52, Michel Jouvin >>>>>>>> michel.jou...@ijclab.in2p3.fr a écrit : >>>>>>>> >>>>>>>>> Eugen, >>>>>>>>> >>>>>>>>> Thanks for doing the test. I scanned all logs and cannot find >>>>>>>>> anything >>>>>>>>> except the message mentioned displayed every 10s about the removed >>>>>>>>> OSDs >>>>>>>>> that led me to think there is something not exactly as expected... >>>>>>>>> No clue >>>>>>>>> what... >>>>>>>>> >>>>>>>>> Michel >>>>>>>>> Sent from my mobile >>>>>>>>> Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit : >>>>>>>>> >>>>>>>>>> I just tried this on a single-node virtual test cluster, deployed it >>>>>>>>>> with 18.2.2. Then I removed one OSD with --replace flag (no --zap, >>>>>>>>>> otherwise it would redeploy the OSD on that VM). Then I also see the >>>>>>>>>> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished >>>>>>>>>> successfully. That's why I don't think the stray daemon is the root >>>>>>>>>> cause here. I would suggest to scan monitor and cephadm logs as >>>>>>>>>> well. >>>>>>>>>> After the upgrade to 18.2.6 the stray warning cleared, btw. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>>> >>>>>>>>>>> Eugen, >>>>>>>>>>> >>>>>>>>>>> As said in a previous message, I found a tracker issue with a >>>>>>>>>>> similar problem: https://tracker.ceph.com/issues/67018, even if the >>>>>>>>>>> cause may be different as it is in older versions than me. For some >>>>>>>>>>> reasons the sequence of messages every 10s is now back on the 2 >>>>>>>>>>> OSDs: >>>>>>>>>>> >>>>>>>>>>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>>>>> osd.253 now down >>>>>>>>>>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>>>>>>>>> osd.253 on dig-osd4 was already removed >>>>>>>>>>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for >>>>>>>>>>> replacement >>>>>>>>>>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping >>>>>>>>>>> devices for osd.253 on dig-osd4 >>>>>>>>>>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>>>>> osd.381 now down >>>>>>>>>>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>>>>>>>>> osd.381 on dig-osd6 was already removed >>>>>>>>>>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for >>>>>>>>>>> replacement >>>>>>>>>>> >>>>>>>>>>> except that the "Zapping.." message is not present for the >>>>>>>>>>> second OSD... >>>>>>>>>>> >>>>>>>>>>> I tried to increase the mgr log verbosity with 'ceph tell >>>>>>>>>>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there >>>>>>>>>>> stop/start >>>>>>>>>>> the upgrade without any additonal message displayed. >>>>>>>>>>> >>>>>>>>>>> Michel >>>>>>>>>>> >>>>>>>>>>> Le 28/04/2025 à 09:20, Eugen Block a écrit : >>>>>>>>>>>> Have you increased the debug level for the mgr? It would surprise >>>>>>>>>>>> me if stray daemons would really block an upgrade. But debug logs >>>>>>>>>>>> might reveal something. And if it can be confirmed that the strays >>>>>>>>>>>> really block the upgrade, you could either remove the OSDs >>>>>>>>>>>> entirely >>>>>>>>>>>> (they are already drained) to continue upgrading, or create a >>>>>>>>>>>> tracker issue to report this and wait for instructions. >>>>>>>>>>>> >>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Eugen, >>>>>>>>>>>>> >>>>>>>>>>>>> Yes I stopped and restarted the upgrade several times already, in >>>>>>>>>>>>> particular after failing over the mgr. And the only messages >>>>>>>>>>>>> related are the upgrade started and upgrade canceled ones. >>>>>>>>>>>>> Nothing >>>>>>>>>>>>> related to an error or a crash... >>>>>>>>>>>>> >>>>>>>>>>>>> For me the question is why do I have stray daemons after removing >>>>>>>>>>>>> OSD. IMO it is unexpected as these daemons are not there anymore. >>>>>>>>>>>>> I can understand that stray daemons prevent the upgrade to start >>>>>>>>>>>>> if they are really strayed... And it would be nice if cephadm was >>>>>>>>>>>>> giving a message about why the upgrade does not really start >>>>>>>>>>>>> despite its status is "in progress"... >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Michel >>>>>>>>>>>>> Sent from my mobile >>>>>>>>>>>>> Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit : >>>>>>>>>>>>> >>>>>>>>>>>>>> Do you see anything in the mgr log? To get fresh logs I would >>>>>>>>>>>>>> cancel >>>>>>>>>>>>>> the upgrade (ceph orch upgrade stop) and then try again. >>>>>>>>>>>>>> A workaround could be to manually upgrade the mgr daemons by >>>>>>>>>>>>>> changing >>>>>>>>>>>>>> their unit.run file, but that would be my last resort. Btwm >>>>>>>>>>>>>> did you >>>>>>>>>>>>>> stop and start the upgrade after failing the mgr as well? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Eugen, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for the hint. Here is the osd_remove_queue: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph config-key get >>>>>>>>>>>>>>> mgr/cephadm/osd_remove_queue|jq >>>>>>>>>>>>>>> [ >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "osd_id": 253, >>>>>>>>>>>>>>> "started": true, >>>>>>>>>>>>>>> "draining": false, >>>>>>>>>>>>>>> "stopped": false, >>>>>>>>>>>>>>> "replace": true, >>>>>>>>>>>>>>> "force": false, >>>>>>>>>>>>>>> "zap": true, >>>>>>>>>>>>>>> "hostname": "dig-osd4", >>>>>>>>>>>>>>> "drain_started_at": null, >>>>>>>>>>>>>>> "drain_stopped_at": null, >>>>>>>>>>>>>>> "drain_done_at": "2025-04-15T14:09:30.521534Z", >>>>>>>>>>>>>>> "process_started_at": "2025-04-15T14:09:14.091592Z" >>>>>>>>>>>>>>> }, >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "osd_id": 381, >>>>>>>>>>>>>>> "started": true, >>>>>>>>>>>>>>> "draining": false, >>>>>>>>>>>>>>> "stopped": false, >>>>>>>>>>>>>>> "replace": true, >>>>>>>>>>>>>>> "force": false, >>>>>>>>>>>>>>> "zap": false, >>>>>>>>>>>>>>> "hostname": "dig-osd6", >>>>>>>>>>>>>>> "drain_started_at": "2025-04-23T11:56:09.864724Z", >>>>>>>>>>>>>>> "drain_stopped_at": null, >>>>>>>>>>>>>>> "drain_done_at": "2025-04-25T06:53:03.678729Z", >>>>>>>>>>>>>>> "process_started_at": "2025-04-23T11:56:05.924923Z" >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It is not empty the two stray daemons are listed. Not sure >>>>>>>>>>>>>>> it these >>>>>>>>>>>>>>> entries are expected as I specified --replace... A similar >>>>>>>>>>>>>>> issue was >>>>>>>>>>>>>>> reported in https://tracker.ceph.com/issues/67018 so before >>>>>>>>>>>>>>> Reef but >>>>>>>>>>>>>>> the cause may be different. Still not clear for me how to >>>>>>>>>>>>>>> get out of >>>>>>>>>>>>>>> this, except may be replacing the OSDs but this will take >>>>>>>>>>>>>>> some time... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Michel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit : >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> what's the current ceph status? Wasn't there a bug in early >>>>>>>>>>>>>>>> Reef >>>>>>>>>>>>>>>> versions preventing upgrades if there were removed OSDs in the >>>>>>>>>>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> And then I would check the mgr log, maybe set it to a >>>>>>>>>>>>>>>> higher debug >>>>>>>>>>>>>>>> level to see what's blocking it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2 >>>>>>>>>>>>>>>>> standby) >>>>>>>>>>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't >>>>>>>>>>>>>>>>> really understand why I get these stray daemons after doing a >>>>>>>>>>>>>>>>> 'ceph orch osd rm --replace` but I think I have always >>>>>>>>>>>>>>>>> seen this. >>>>>>>>>>>>>>>>> I tried to mute rather than disable the stray daemon check >>>>>>>>>>>>>>>>> but it >>>>>>>>>>>>>>>>> doesn't help either. And I find strange this message every >>>>>>>>>>>>>>>>> 10s >>>>>>>>>>>>>>>>> about one of the destroyed OSD and only one, reporting it >>>>>>>>>>>>>>>>> is down >>>>>>>>>>>>>>>>> and already destroyed and saying it'll zap it (I think I >>>>>>>>>>>>>>>>> didn't >>>>>>>>>>>>>>>>> add --zap when I removed it as the underlying disk is dead). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm completely stuck with this upgrade and I don't >>>>>>>>>>>>>>>>> remember having >>>>>>>>>>>>>>>>> this kind of problems in previous upgrades with cephadm... >>>>>>>>>>>>>>>>> Any >>>>>>>>>>>>>>>>> idea where to look for the cause and/or how to fix it? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Michel >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit : >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to >>>>>>>>>>>>>>>>>> 18.2.6, >>>>>>>>>>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph >>>>>>>>>>>>>>>>>> orch >>>>>>>>>>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message >>>>>>>>>>>>>>>>>> saying >>>>>>>>>>>>>>>>>> that the upgrade has been initiated, with a similar >>>>>>>>>>>>>>>>>> message in >>>>>>>>>>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade >>>>>>>>>>>>>>>>>> status' says: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status >>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>> "target_image": "quay.io/ceph/ceph:v18.2.6", >>>>>>>>>>>>>>>>>> "in_progress": true, >>>>>>>>>>>>>>>>>> "which": "Upgrading all daemon types on all hosts", >>>>>>>>>>>>>>>>>> "services_complete": [], >>>>>>>>>>>>>>>>>> "progress": "", >>>>>>>>>>>>>>>>>> "message": "", >>>>>>>>>>>>>>>>>> "is_paused": false >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> ------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The first time I entered the command, the cluster status was >>>>>>>>>>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by >>>>>>>>>>>>>>>>>> destroyed OSDs, >>>>>>>>>>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to >>>>>>>>>>>>>>>>>> false >>>>>>>>>>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK >>>>>>>>>>>>>>>>>> but it >>>>>>>>>>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to >>>>>>>>>>>>>>>>>> failover >>>>>>>>>>>>>>>>>> the mgr, stopped an restarted the upgrade but without any >>>>>>>>>>>>>>>>>> improvement. I have not seen anything in the logs, except >>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> there is an INF entry every 10s about the destroyed OSD >>>>>>>>>>>>>>>>>> saying: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down >>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on >>>>>>>>>>>>>>>>>> dig-osd4 >>>>>>>>>>>>>>>>>> was already removed >>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully >>>>>>>>>>>>>>>>>> destroyed old >>>>>>>>>>>>>>>>>> osd.253 on dig-osd4; ready for replacement >>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for >>>>>>>>>>>>>>>>>> osd.253 >>>>>>>>>>>>>>>>>> on dig-osd4 >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs >>>>>>>>>>>>>>>>>> since I restarted the mgr. May this be the cause for the >>>>>>>>>>>>>>>>>> stucked >>>>>>>>>>>>>>>>>> upgrade? What can I do for fixing this? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks in advance for any hint. Best regards, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Michel >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>> _______________________________________________ >>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io