[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Anthony D'Atri Wed, 30 Apr 2025 04:41:14 -0700

I’ve had a similar experience with Reef, trying to destroy an improperly 
deployed OSD on a viable drive.   I had to ceph-volume lvm zap to get past the 
purge and at no point was the OSD marked as destroyed.


> On Apr 30, 2025, at 6:15 AM, Eugen Block <ebl...@nde.ag> wrote:
> 
> Right, 'ceph osd destroy' most likely won't help you here. The --replace 
> flag is only to mark an OSD as destroyed (so it will reuse its ID after 
> replacing the drive).
> You wrote that stopping osd rm for 253 unblocked the upgrade, so the cluster 
> is currently upgrading?
> 
> To clear the pending state, I would stop rm for the other OSD as well since 
> it's already out and down anyway. You can always zap a drive, either directly 
> on the host with:
> 
> cephadm ceph-volume lvm zap --destroy /dev/sdX
> 
> Or using the orchestrator:
> 
> orch device zap <hostname> <path> [--force]
> 
> But just to clarify, OSD.381 is already the replacement disk for a previously 
> failed drive? If you zap it, the orchestrator would try to apply any matching 
> spec and create a new OSD, probably with ID 381 again.
> 
> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
> 
>> Frédéric,
>> 
>> My situation is a bit different I think. I had two malfunctionning OSDs that 
>> I removed with `ceph orch osd rm --replace --zap`: 1 was really dead and no 
>> longer seen by the OS (osd.253) and the other one with a lot of HW errors 
>> but still there (osd.381). Both have been successfully marked as destroyed 
>> in the CRUSH map. Just I didn't realize that cephadm was retrying every 10s 
>> to zap osd.253, getting an error as the disk could not be found. Looking at 
>> the removal status this morning(the removal was done ~2 weeks ago) with 
>> 'ceph orch osd rm status', I got:
>> 
>> OSD  HOST      STATE                    PGS  REPLACE  FORCE ZAP    DRAIN 
>> STARTED AT
>> 253  dig-osd4  done, waiting for purge    0  True     False  True
>> 381  dig-osd6  done, waiting for purge    0  True     False False  
>> 2025-04-23 11:56:09.864724+00:00
>> 
>> I don't know what the status "waiting to purge" means... but we can see that 
>> cephadm considers that the drain never started for osd.253, as the device 
>> was unavailable I guess... What happens with dig-osd6 is less clear to me 
>> but may be the consequence that at some point the disk freed by the initial 
>> rm was picked up by cephadm and readded as the replacement OSD because we 
>> forgot to set the osd.all-available-devices service to unmanaged. The drain 
>> started on Apr 23 is the second rm I did after fixing 
>> osd.all-available-devices service. For this second attempt, I didn't specify 
>> --zap, not sure why ( a mistake!).
>> 
>> I have the feeling but I may be wrong, that 'ceph osd destroy' will not help 
>> as they are already marked destroyed in the CRUSH map...
>> 
>> I'm wondering wether I should do 'ceph orch osd rm stop 381'  as I did for 
>> 253 or whether it will impact the replacement later. Or said in a different 
>> way, is the replace flag something managed by cephadm and requiring the OSD 
>> to stay in the "rm queue" until the replacement is done?
>> 
>> Best regards,
>> 
>> Michel
>> 
>>> Le 30/04/2025 à 10:50, Frédéric Nass a écrit :
>>> Hi Michel,
>>> 
>>> I've seen this recently on Reef (OSD stuck in the rm queue with the 
>>> orchestrator tryng to zap a device that had already been zapped).
>>> 
>>> I could reproduce this a few times by deleting a batch of OSDs running on 
>>> the same node. The whole 'ceph orch osd rm' process would stop progressing 
>>> when trying to remove the ~8th OSD. I suspect that ceph-volume or the 
>>> orchestrator is misinformed at some point that the device has already been 
>>> zapped, looping over and over trying to remove this device that doesn't 
>>> exist anymore.
>>> 
>>> I think you should now run 'ceph osd destroy <OSD_ID> 
>>> --yes-i-really-mean-it'.
>>> 
>>> Regards,
>>> Frédéric.
>>> 
>>> ----- Le 30 Avr 25, à 10:28, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
>>> écrit :
>>> 
>>>> Eugen,
>>>> 
>>>> Thanks, I forgot that operation started with the orchestrator can be
>>>> stopped. You were right: stopping the 'osd rm' was enough to unblock the
>>>> upgrade. I am not completely sure what is the consequence on the replace
>>>> flag: I have the feeling it has been lost somehow as the OSD is no
>>>> longer listed by 'ceph orch osd rm status' and 'ceph -s' reports now one
>>>> OSD down and 1 stray daemon instead of 2 stray daemons.
>>>> 
>>>> Michel
>>>> 
>>>> Le 30/04/2025 à 09:24, Eugen Block a écrit :
>>>>> You can stop the osd removal:
>>>>> 
>>>>> ceph orch osd rm stop <OSD_ID>
>>>>> 
>>>>> I'm not entirely sure what the orchestrator will do except for
>>>>> clearing the pending state, and since the OSDs are already marked as
>>>>> destroyed in the crush tree, I wouldn't expect anything weird. But
>>>>> it's worth a try, I guess.
>>>>> 
>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I had no time to really investigate more our problem yesterday. But I
>>>>>> realized one issue that may explain the problem with osd.253: the
>>>>>> underlying disk is so dead that it is no longer visible by the OS.
>>>>>> Probably I added --zap when I did the 'ceph orch osd rm' and thus it
>>>>>> is trying to do the zapping, fails as it doesn't find the disk and
>>>>>> retries indefinitely... I remain a little bit surprise that this
>>>>>> zapping error is not reported (without the traceback) at the INFO
>>>>>> level and requires DEBUG to be seen but it is a detail. I'm surprised
>>>>>> that Ceph is not giving up on zapping if it cannot access the device
>>>>>> or did I miss something and there is a way to stop this process?
>>>>>> 
>>>>>> May be it is a corner case that has been fixed/improved since
>>>>>> 18.2.2... Anyway, the question remains: is there a way out of this
>>>>>> problem (that seems the only reason for the upgrade not really
>>>>>> starting) apart from getting the replacement device?
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Michel
>>>>>> 
>>>>>> Le 28/04/2025 à 18:19, Michel Jouvin a écrit :
>>>>>>> Hi Frédéric,
>>>>>>> 
>>>>>>> Thanks for the command. I'm always looking at the wrong page of the
>>>>>>> doc! I looked at
>>>>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/
>>>>>>> that list the Ceph subsystem and their default log level but there
>>>>>>> is no mention of cephadm there... After enabling cephadm debug log
>>>>>>> level and restarting the upgrade, I got the messages below. The only
>>>>>>> thing strange points to the problem with osd.253 where it tries to
>>>>>>> zap the device that was probably already zapped and thus cannot find
>>>>>>> the LV volume associated with osd.253. There is not really any other
>>>>>>> messages saying the impact on the upgrade but I guess it is the
>>>>>>> reason. What do you think ? And is there any way to fix it, other
>>>>>>> than replacing the OSD?
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Michel
>>>>>>> 
>>>>>>> --------------------- cephadm debug level log -------------------------
>>>>>>> 
>>>>>>> 2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade:
>>>>>>> Started with target quay.io/ceph/ceph:v18.2.6
>>>>>>> 2025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshed
>>>>>>> host dig-osd4 devices (23)
>>>>>>> 2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] Finding
>>>>>>> OSDSpecs for host: <dig-osd4>
>>>>>>> 2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] Generating
>>>>>>> OSDSpec previews for []
>>>>>>> 2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] Loading
>>>>>>> OSDSpec previews to HostCache for host <dig-osd4>
>>>>>>> 2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'config generate-minimal-conf' -> 0 in 0.005s
>>>>>>> 2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'auth get' -> 0 in 0.277s
>>>>>>> 2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combine
>>>>>>> hosts with existing daemons [] + new hosts.... (very long line)
>>>>>>> 
>>>>>>> 2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> _update_paused_health
>>>>>>> 2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.064s
>>>>>>> 2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.320s
>>>>>>> 2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs are
>>>>>>> scheduled for removal: [osd.381, osd.253]
>>>>>>> 2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.011s
>>>>>>> 2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s
>>>>>>> 2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> safe-to-destroy returns:
>>>>>>> 2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] running
>>>>>>> cmd: osd down on ids [osd.381]
>>>>>>> 2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd down' -> 0 in 0.001s
>>>>>>> 2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381
>>>>>>> now down
>>>>>>> 2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>> osd.381 on dig-osd6 was already removed
>>>>>>> 2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s
>>>>>>> 2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> destroy-actual returns:
>>>>>>> 2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement
>>>>>>> 2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removing
>>>>>>> osd.381 from the queue.
>>>>>>> 2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd df' -> 0 in 0.010s
>>>>>>> 2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s
>>>>>>> 2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> safe-to-destroy returns:
>>>>>>> 2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] running
>>>>>>> cmd: osd down on ids [osd.253]
>>>>>>> 2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd down' -> 0 in 0.000s
>>>>>>> 2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253
>>>>>>> now down
>>>>>>> 2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>> osd.253 on dig-osd4 was already removed
>>>>>>> 2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s
>>>>>>> 2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>>>>> destroy-actual returns:
>>>>>>> 2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement
>>>>>>> 2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>>>>>> devices for osd.253 on dig-osd4
>>>>>>> 2025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> _run_cephadm : command = ceph-volume
>>>>>>> 2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG]
>>>>>>> _run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253',
>>>>>>> '--destroy']
>>>>>>> 2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osd
>>>>>>> container image
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> 2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args:
>>>>>>> --image
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> --timeout 895 ceph-volume --fsid
>>>>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
>>>>>>> 2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Running
>>>>>>> command: which python3
>>>>>>> 2025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Running
>>>>>>> command: /usr/bin/python3
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d
>>>>>>> --image
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> --timeout 895 ceph-volume --fsid
>>>>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
>>>>>>> 2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1
>>>>>>> 2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err:
>>>>>>> Inferring config
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.conf
>>>>>>> Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
>>>>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>>>>>>> --privileged --group-add=disk --init -e
>>>>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e
>>>>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
>>>>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z
>>>>>>> -v
>>>>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z
>>>>>>> -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z
>>>>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
>>>>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>>>>>>> /run/lock/lvm:/run/lock/lvm -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro
>>>>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
>>>>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> lvm zap --osd-id 253 --destroy
>>>>>>> /usr/bin/podman: stderr Traceback (most recent call last):
>>>>>>> /usr/bin/podman: stderr   File "/usr/sbin/ceph-volume", line 11, in
>>>>>>> <module>
>>>>>>> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0',
>>>>>>> 'console_scripts', 'ceph-volume')()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in
>>>>>>> __init__
>>>>>>> /usr/bin/podman: stderr     self.main(self.argv)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line
>>>>>>> 59, in newfunc
>>>>>>> /usr/bin/podman: stderr     return f(*a, **kw)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in
>>>>>>> main
>>>>>>> /usr/bin/podman: stderr     terminal.dispatch(self.mapper,
>>>>>>> subcommand_args)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line
>>>>>>> 194, in dispatch
>>>>>>> /usr/bin/podman: stderr     instance.main()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py",
>>>>>>> line 46, in main
>>>>>>> /usr/bin/podman: stderr     terminal.dispatch(self.mapper, self.argv)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line
>>>>>>> 194, in dispatch
>>>>>>> /usr/bin/podman: stderr     instance.main()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>>>>> line 403, in main
>>>>>>> /usr/bin/podman: stderr     self.zap_osd()
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line
>>>>>>> 16, in is_root
>>>>>>> /usr/bin/podman: stderr     return func(*a, **kw)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>>>>> line 301, in zap_osd
>>>>>>> /usr/bin/podman: stderr     devices =
>>>>>>> find_associated_devices(self.args.osd_id, self.args.osd_fsid)
>>>>>>> /usr/bin/podman: stderr   File
>>>>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>>>>> line 88, in find_associated_devices
>>>>>>> /usr/bin/podman: stderr     '%s' % osd_id or osd_fsid)
>>>>>>> /usr/bin/podman: stderr RuntimeError: Unable to find any LV for
>>>>>>> zapping OSD: 253
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "/usr/lib64/python3.9/runpy.py", line 197, in
>>>>>>> _run_module_as_main
>>>>>>>     return _run_code(code, main_globals, None,
>>>>>>>   File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
>>>>>>>     exec(code, run_globals)
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 10700, in <module>
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 10688, in main
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2445, in _infer_config
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2361, in _infer_fsid
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2473, in _infer_image
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2348, in _validate_fsid
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 6970, in command_ceph_volume
>>>>>>>   File
>>>>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>>>>> line 2136, in call_throws
>>>>>>> RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
>>>>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>>>>>>> --privileged --group-add=disk --init -e
>>>>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e
>>>>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
>>>>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z
>>>>>>> -v
>>>>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z
>>>>>>> -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z
>>>>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
>>>>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>>>>>>> /run/lock/lvm:/run/lock/lvm -v
>>>>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro
>>>>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
>>>>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z
>>>>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>>>>> lvm zap --osd-id 253 --destroy
>>>>>>> 2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loop
>>>>>>> sleep
>>>>>>> 
>>>>>>> -----------------------
>>>>>>> 
>>>>>>> 
>>>>>>> Le 28/04/2025 à 14:00, Frédéric Nass a écrit :
>>>>>>>> Hi Michel,
>>>>>>>> 
>>>>>>>> You need to turn on cephadm debugging as described here [1] in the
>>>>>>>> documentation
>>>>>>>> 
>>>>>>>> $ ceph config set mgr mgr/cephadm/log_to_cluster_level debug
>>>>>>>> 
>>>>>>>> and then look for any hints with
>>>>>>>> 
>>>>>>>> $ ceph -W cephadm --watch-debug
>>>>>>>> 
>>>>>>>> or
>>>>>>>> 
>>>>>>>> $ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the
>>>>>>>> active MGR)
>>>>>>>> 
>>>>>>>> when you start/stop the upgrade.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Frédéric.
>>>>>>>> 
>>>>>>>> [1] https://docs.ceph.com/en/reef/cephadm/operations/
>>>>>>>> 
>>>>>>>> ----- Le 28 Avr 25, à 12:52, Michel Jouvin
>>>>>>>> michel.jou...@ijclab.in2p3.fr a écrit :
>>>>>>>> 
>>>>>>>>> Eugen,
>>>>>>>>> 
>>>>>>>>> Thanks for doing the test. I scanned all logs and cannot find
>>>>>>>>> anything
>>>>>>>>> except the message mentioned displayed every 10s about the removed
>>>>>>>>> OSDs
>>>>>>>>> that led me to think there is something not exactly as expected...
>>>>>>>>> No clue
>>>>>>>>> what...
>>>>>>>>> 
>>>>>>>>> Michel
>>>>>>>>> Sent from my mobile
>>>>>>>>> Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit :
>>>>>>>>> 
>>>>>>>>>> I just tried this on a single-node virtual test cluster, deployed it
>>>>>>>>>> with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
>>>>>>>>>> otherwise it would redeploy the OSD on that VM). Then I also see the
>>>>>>>>>> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
>>>>>>>>>> successfully. That's why I don't think the stray daemon is the root
>>>>>>>>>> cause here. I would suggest to scan monitor and cephadm logs as
>>>>>>>>>> well.
>>>>>>>>>> After the upgrade to 18.2.6 the stray warning cleared, btw.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>> 
>>>>>>>>>>> Eugen,
>>>>>>>>>>> 
>>>>>>>>>>> As said in a previous message, I found a tracker issue with a
>>>>>>>>>>> similar problem: https://tracker.ceph.com/issues/67018, even if the
>>>>>>>>>>> cause may be different as it is in older versions than me. For some
>>>>>>>>>>> reasons the sequence of messages every 10s is now back on the 2
>>>>>>>>>>> OSDs:
>>>>>>>>>>> 
>>>>>>>>>>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> osd.253 now down
>>>>>>>>>>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>>>>>> osd.253 on dig-osd4 was already removed
>>>>>>>>>>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for
>>>>>>>>>>> replacement
>>>>>>>>>>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>>>>>>>>>> devices for osd.253 on dig-osd4
>>>>>>>>>>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> osd.381 now down
>>>>>>>>>>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>>>>>> osd.381 on dig-osd6 was already removed
>>>>>>>>>>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for
>>>>>>>>>>> replacement
>>>>>>>>>>> 
>>>>>>>>>>> except that the "Zapping.." message is not present for the
>>>>>>>>>>> second OSD...
>>>>>>>>>>> 
>>>>>>>>>>> I tried to increase the mgr log verbosity with 'ceph tell
>>>>>>>>>>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there
>>>>>>>>>>> stop/start
>>>>>>>>>>> the upgrade without any additonal message displayed.
>>>>>>>>>>> 
>>>>>>>>>>> Michel
>>>>>>>>>>> 
>>>>>>>>>>> Le 28/04/2025 à 09:20, Eugen Block a écrit :
>>>>>>>>>>>> Have you increased the debug level for the mgr? It would surprise
>>>>>>>>>>>> me if stray daemons would really block an upgrade. But debug logs
>>>>>>>>>>>> might reveal something. And if it can be confirmed that the strays
>>>>>>>>>>>> really block the upgrade, you could either remove the OSDs
>>>>>>>>>>>> entirely
>>>>>>>>>>>> (they are already drained) to continue upgrading, or create a
>>>>>>>>>>>> tracker issue to report this and wait for instructions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Eugen,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes I stopped and restarted the upgrade several times already, in
>>>>>>>>>>>>> particular after failing over the mgr. And the only messages
>>>>>>>>>>>>> related are the upgrade started and upgrade canceled ones.
>>>>>>>>>>>>> Nothing
>>>>>>>>>>>>> related to an error or a crash...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For me the question is why do I have stray daemons after removing
>>>>>>>>>>>>> OSD. IMO it is unexpected as these daemons are not there anymore.
>>>>>>>>>>>>> I can understand that stray daemons prevent the upgrade to start
>>>>>>>>>>>>> if they are really strayed... And it would be nice if cephadm was
>>>>>>>>>>>>> giving a message about why the upgrade does not really start
>>>>>>>>>>>>> despite its status is "in progress"...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Michel
>>>>>>>>>>>>> Sent from my mobile
>>>>>>>>>>>>> Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit :
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Do you see anything in the mgr log? To get fresh logs I would
>>>>>>>>>>>>>> cancel
>>>>>>>>>>>>>> the upgrade (ceph orch upgrade stop) and then try again.
>>>>>>>>>>>>>> A workaround could be to manually upgrade the mgr daemons by
>>>>>>>>>>>>>> changing
>>>>>>>>>>>>>> their unit.run file, but that would be my last resort. Btwm
>>>>>>>>>>>>>> did you
>>>>>>>>>>>>>> stop and start the upgrade after failing the mgr as well?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Eugen,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the hint. Here is the osd_remove_queue:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph config-key get
>>>>>>>>>>>>>>> mgr/cephadm/osd_remove_queue|jq
>>>>>>>>>>>>>>> [
>>>>>>>>>>>>>>>   {
>>>>>>>>>>>>>>>     "osd_id": 253,
>>>>>>>>>>>>>>>     "started": true,
>>>>>>>>>>>>>>>     "draining": false,
>>>>>>>>>>>>>>>     "stopped": false,
>>>>>>>>>>>>>>>     "replace": true,
>>>>>>>>>>>>>>>     "force": false,
>>>>>>>>>>>>>>>     "zap": true,
>>>>>>>>>>>>>>>     "hostname": "dig-osd4",
>>>>>>>>>>>>>>>     "drain_started_at": null,
>>>>>>>>>>>>>>>     "drain_stopped_at": null,
>>>>>>>>>>>>>>>     "drain_done_at": "2025-04-15T14:09:30.521534Z",
>>>>>>>>>>>>>>>     "process_started_at": "2025-04-15T14:09:14.091592Z"
>>>>>>>>>>>>>>>   },
>>>>>>>>>>>>>>>   {
>>>>>>>>>>>>>>>     "osd_id": 381,
>>>>>>>>>>>>>>>     "started": true,
>>>>>>>>>>>>>>>     "draining": false,
>>>>>>>>>>>>>>>     "stopped": false,
>>>>>>>>>>>>>>>     "replace": true,
>>>>>>>>>>>>>>>     "force": false,
>>>>>>>>>>>>>>>     "zap": false,
>>>>>>>>>>>>>>>     "hostname": "dig-osd6",
>>>>>>>>>>>>>>>     "drain_started_at": "2025-04-23T11:56:09.864724Z",
>>>>>>>>>>>>>>>     "drain_stopped_at": null,
>>>>>>>>>>>>>>>     "drain_done_at": "2025-04-25T06:53:03.678729Z",
>>>>>>>>>>>>>>>     "process_started_at": "2025-04-23T11:56:05.924923Z"
>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>> ]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It is not empty the two stray daemons are listed. Not sure
>>>>>>>>>>>>>>> it these
>>>>>>>>>>>>>>> entries are expected as I specified --replace... A similar
>>>>>>>>>>>>>>> issue was
>>>>>>>>>>>>>>> reported in https://tracker.ceph.com/issues/67018 so before
>>>>>>>>>>>>>>> Reef but
>>>>>>>>>>>>>>> the cause may be different. Still not clear for me how to
>>>>>>>>>>>>>>> get out of
>>>>>>>>>>>>>>> this, except may be replacing the OSDs but this will take
>>>>>>>>>>>>>>> some time...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit :
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> what's the current ceph status? Wasn't there a bug in early
>>>>>>>>>>>>>>>> Reef
>>>>>>>>>>>>>>>> versions preventing upgrades if there were removed OSDs in the
>>>>>>>>>>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> And then I would check the mgr log, maybe set it to a
>>>>>>>>>>>>>>>> higher debug
>>>>>>>>>>>>>>>> level to see what's blocking it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2
>>>>>>>>>>>>>>>>> standby)
>>>>>>>>>>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't
>>>>>>>>>>>>>>>>> really understand why I get these stray daemons after doing a
>>>>>>>>>>>>>>>>> 'ceph orch osd rm --replace` but I think I have always
>>>>>>>>>>>>>>>>> seen this.
>>>>>>>>>>>>>>>>> I tried to mute rather than disable the stray daemon check
>>>>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>>>>> doesn't help either. And I find strange this message every
>>>>>>>>>>>>>>>>> 10s
>>>>>>>>>>>>>>>>> about one of the destroyed OSD and only one, reporting it
>>>>>>>>>>>>>>>>> is down
>>>>>>>>>>>>>>>>> and already destroyed and saying it'll zap it (I think I
>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>> add --zap when I removed it as the underlying disk is dead).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'm completely stuck with this upgrade and I don't
>>>>>>>>>>>>>>>>> remember having
>>>>>>>>>>>>>>>>> this kind of problems in previous upgrades with cephadm...
>>>>>>>>>>>>>>>>> Any
>>>>>>>>>>>>>>>>> idea where to look for the cause and/or how to fix it?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to
>>>>>>>>>>>>>>>>>> 18.2.6,
>>>>>>>>>>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph
>>>>>>>>>>>>>>>>>> orch
>>>>>>>>>>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message
>>>>>>>>>>>>>>>>>> saying
>>>>>>>>>>>>>>>>>> that the upgrade has been initiated, with a similar
>>>>>>>>>>>>>>>>>> message in
>>>>>>>>>>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade
>>>>>>>>>>>>>>>>>> status' says:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status
>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>     "target_image": "quay.io/ceph/ceph:v18.2.6",
>>>>>>>>>>>>>>>>>>     "in_progress": true,
>>>>>>>>>>>>>>>>>>     "which": "Upgrading all daemon types on all hosts",
>>>>>>>>>>>>>>>>>>     "services_complete": [],
>>>>>>>>>>>>>>>>>>     "progress": "",
>>>>>>>>>>>>>>>>>>     "message": "",
>>>>>>>>>>>>>>>>>>     "is_paused": false
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The first time I entered the command, the cluster status was
>>>>>>>>>>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by
>>>>>>>>>>>>>>>>>> destroyed OSDs,
>>>>>>>>>>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to
>>>>>>>>>>>>>>>>>> false
>>>>>>>>>>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK
>>>>>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to
>>>>>>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>>>>> the mgr, stopped an restarted the upgrade but without any
>>>>>>>>>>>>>>>>>> improvement. I have not seen anything in the logs, except
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> there is an INF entry every 10s about the destroyed OSD
>>>>>>>>>>>>>>>>>> saying:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on
>>>>>>>>>>>>>>>>>> dig-osd4
>>>>>>>>>>>>>>>>>> was already removed
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully
>>>>>>>>>>>>>>>>>> destroyed old
>>>>>>>>>>>>>>>>>> osd.253 on dig-osd4; ready for replacement
>>>>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for
>>>>>>>>>>>>>>>>>> osd.253
>>>>>>>>>>>>>>>>>> on dig-osd4
>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs
>>>>>>>>>>>>>>>>>> since I restarted the mgr. May this be the cause for the
>>>>>>>>>>>>>>>>>> stucked
>>>>>>>>>>>>>>>>>> upgrade? What can I do for fixing this?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks in advance for any hint. Best regards,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>> 
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Reply via email to