Hi Michel, You need to turn on cephadm debugging as described here [1] in the documentation
$ ceph config set mgr mgr/cephadm/log_to_cluster_level debug and then look for any hints with $ ceph -W cephadm --watch-debug or $ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the active MGR) when you start/stop the upgrade. Regards, Frédéric. [1] https://docs.ceph.com/en/reef/cephadm/operations/ ----- Le 28 Avr 25, à 12:52, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit : > Eugen, > > Thanks for doing the test. I scanned all logs and cannot find anything > except the message mentioned displayed every 10s about the removed OSDs > that led me to think there is something not exactly as expected... No clue > what... > > Michel > Sent from my mobile > Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit : > >> I just tried this on a single-node virtual test cluster, deployed it >> with 18.2.2. Then I removed one OSD with --replace flag (no --zap, >> otherwise it would redeploy the OSD on that VM). Then I also see the >> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished >> successfully. That's why I don't think the stray daemon is the root >> cause here. I would suggest to scan monitor and cephadm logs as well. >> After the upgrade to 18.2.6 the stray warning cleared, btw. >> >> >> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >> >>> Eugen, >>> >>> As said in a previous message, I found a tracker issue with a >>> similar problem: https://tracker.ceph.com/issues/67018, even if the >>> cause may be different as it is in older versions than me. For some >>> reasons the sequence of messages every 10s is now back on the 2 OSDs: >>> >>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF] osd.253 now down >>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon >>> osd.253 on dig-osd4 was already removed >>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF] >>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement >>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping >>> devices for osd.253 on dig-osd4 >>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF] osd.381 now down >>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon >>> osd.381 on dig-osd6 was already removed >>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF] >>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement >>> >>> except that the "Zapping.." message is not present for the second OSD... >>> >>> I tried to increase the mgr log verbosity with 'ceph tell >>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there stop/start >>> the upgrade without any additonal message displayed. >>> >>> Michel >>> >>> Le 28/04/2025 à 09:20, Eugen Block a écrit : >>>> Have you increased the debug level for the mgr? It would surprise >>>> me if stray daemons would really block an upgrade. But debug logs >>>> might reveal something. And if it can be confirmed that the strays >>>> really block the upgrade, you could either remove the OSDs entirely >>>> (they are already drained) to continue upgrading, or create a >>>> tracker issue to report this and wait for instructions. >>>> >>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>> >>>>> Hi Eugen, >>>>> >>>>> Yes I stopped and restarted the upgrade several times already, in >>>>> particular after failing over the mgr. And the only messages >>>>> related are the upgrade started and upgrade canceled ones. Nothing >>>>> related to an error or a crash... >>>>> >>>>> For me the question is why do I have stray daemons after removing >>>>> OSD. IMO it is unexpected as these daemons are not there anymore. >>>>> I can understand that stray daemons prevent the upgrade to start >>>>> if they are really strayed... And it would be nice if cephadm was >>>>> giving a message about why the upgrade does not really start >>>>> despite its status is "in progress"... >>>>> >>>>> Best regards, >>>>> >>>>> Michel >>>>> Sent from my mobile >>>>> Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit : >>>>> >>>>>> Do you see anything in the mgr log? To get fresh logs I would cancel >>>>>> the upgrade (ceph orch upgrade stop) and then try again. >>>>>> A workaround could be to manually upgrade the mgr daemons by changing >>>>>> their unit.run file, but that would be my last resort. Btwm did you >>>>>> stop and start the upgrade after failing the mgr as well? >>>>>> >>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>> >>>>>>> Eugen, >>>>>>> >>>>>>> Thanks for the hint. Here is the osd_remove_queue: >>>>>>> >>>>>>> [root@ijc-mon1 ~]# ceph config-key get mgr/cephadm/osd_remove_queue|jq >>>>>>> [ >>>>>>> { >>>>>>> "osd_id": 253, >>>>>>> "started": true, >>>>>>> "draining": false, >>>>>>> "stopped": false, >>>>>>> "replace": true, >>>>>>> "force": false, >>>>>>> "zap": true, >>>>>>> "hostname": "dig-osd4", >>>>>>> "drain_started_at": null, >>>>>>> "drain_stopped_at": null, >>>>>>> "drain_done_at": "2025-04-15T14:09:30.521534Z", >>>>>>> "process_started_at": "2025-04-15T14:09:14.091592Z" >>>>>>> }, >>>>>>> { >>>>>>> "osd_id": 381, >>>>>>> "started": true, >>>>>>> "draining": false, >>>>>>> "stopped": false, >>>>>>> "replace": true, >>>>>>> "force": false, >>>>>>> "zap": false, >>>>>>> "hostname": "dig-osd6", >>>>>>> "drain_started_at": "2025-04-23T11:56:09.864724Z", >>>>>>> "drain_stopped_at": null, >>>>>>> "drain_done_at": "2025-04-25T06:53:03.678729Z", >>>>>>> "process_started_at": "2025-04-23T11:56:05.924923Z" >>>>>>> } >>>>>>> ] >>>>>>> >>>>>>> It is not empty the two stray daemons are listed. Not sure it these >>>>>>> entries are expected as I specified --replace... A similar issue was >>>>>>> reported in https://tracker.ceph.com/issues/67018 so before Reef but >>>>>>> the cause may be different. Still not clear for me how to get out of >>>>>>> this, except may be replacing the OSDs but this will take some time... >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Michel >>>>>>> >>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit : >>>>>>>> Hi, >>>>>>>> >>>>>>>> what's the current ceph status? Wasn't there a bug in early Reef >>>>>>>> versions preventing upgrades if there were removed OSDs in the >>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check >>>>>>>> >>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue >>>>>>>> >>>>>>>> And then I would check the mgr log, maybe set it to a higher debug >>>>>>>> level to see what's blocking it. >>>>>>>> >>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2 standby) >>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't >>>>>>>>> really understand why I get these stray daemons after doing a >>>>>>>>> 'ceph orch osd rm --replace` but I think I have always seen this. >>>>>>>>> I tried to mute rather than disable the stray daemon check but it >>>>>>>>> doesn't help either. And I find strange this message every 10s >>>>>>>>> about one of the destroyed OSD and only one, reporting it is down >>>>>>>>> and already destroyed and saying it'll zap it (I think I didn't >>>>>>>>> add --zap when I removed it as the underlying disk is dead). >>>>>>>>> >>>>>>>>> I'm completely stuck with this upgrade and I don't remember having >>>>>>>>> this kind of problems in previous upgrades with cephadm... Any >>>>>>>>> idea where to look for the cause and/or how to fix it? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> Michel >>>>>>>>> >>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit : >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6, >>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph orch >>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message saying >>>>>>>>>> that the upgrade has been initiated, with a similar message in >>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade >>>>>>>>>> status' says: >>>>>>>>>> >>>>>>>>>> ------- >>>>>>>>>> >>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status >>>>>>>>>> { >>>>>>>>>> "target_image": "quay.io/ceph/ceph:v18.2.6", >>>>>>>>>> "in_progress": true, >>>>>>>>>> "which": "Upgrading all daemon types on all hosts", >>>>>>>>>> "services_complete": [], >>>>>>>>>> "progress": "", >>>>>>>>>> "message": "", >>>>>>>>>> "is_paused": false >>>>>>>>>> } >>>>>>>>>> ------- >>>>>>>>>> >>>>>>>>>> The first time I entered the command, the cluster status was >>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs, >>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false >>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK but it >>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to failover >>>>>>>>>> the mgr, stopped an restarted the upgrade but without any >>>>>>>>>> improvement. I have not seen anything in the logs, except that >>>>>>>>>> there is an INF entry every 10s about the destroyed OSD saying: >>>>>>>>>> >>>>>>>>>> ------ >>>>>>>>>> >>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down >>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on dig-osd4 >>>>>>>>>> was already removed >>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully destroyed old >>>>>>>>>> osd.253 on dig-osd4; ready for replacement >>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for osd.253 >>>>>>>>>> on dig-osd4 >>>>>>>>>> ----- >>>>>>>>>> >>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs >>>>>>>>>> since I restarted the mgr. May this be the cause for the stucked >>>>>>>>>> upgrade? What can I do for fixing this? >>>>>>>>>> >>>>>>>>>> Thanks in advance for any hint. Best regards, >>>>>>>>>> >>>>>>>>>> Michel >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io