[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Frédéric Nass Mon, 28 Apr 2025 05:01:53 -0700

Hi Michel,

You need to turn on cephadm debugging as described here [1] in the documentation


$ ceph config set mgr mgr/cephadm/log_to_cluster_level debug

and then look for any hints with

$ ceph -W cephadm --watch-debug

or

$ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log      (on the active MGR)

when you start/stop the upgrade.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/reef/cephadm/operations/

----- Le 28 Avr 25, à 12:52, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Eugen,
> 
> Thanks for doing the test. I scanned all logs and cannot find anything
> except the message mentioned displayed every 10s about the removed OSDs
> that led me to think there is something not exactly as expected... No clue
> what...
> 
> Michel
> Sent from my mobile
> Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit :
> 
>> I just tried this on a single-node virtual test cluster, deployed it
>> with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
>> otherwise it would redeploy the OSD on that VM). Then I also see the
>> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
>> successfully. That's why I don't think the stray daemon is the root
>> cause here. I would suggest to scan monitor and cephadm logs as well.
>> After the upgrade to 18.2.6 the stray warning cleared, btw.
>>
>>
>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>
>>> Eugen,
>>>
>>> As said in a previous message, I found a tracker issue with a
>>> similar problem: https://tracker.ceph.com/issues/67018, even if the
>>> cause may be different as it is in older versions than me. For some
>>> reasons the sequence of messages every 10s is now back on the 2 OSDs:
>>>
>>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF] osd.253 now down
>>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>> osd.253 on dig-osd4 was already removed
>>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
>>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement
>>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>> devices for osd.253 on dig-osd4
>>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF] osd.381 now down
>>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>> osd.381 on dig-osd6 was already removed
>>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
>>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement
>>>
>>> except that the "Zapping.." message is not present for the second OSD...
>>>
>>> I tried to increase the mgr log verbosity with 'ceph tell
>>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there stop/start
>>> the upgrade without any additonal message displayed.
>>>
>>> Michel
>>>
>>> Le 28/04/2025 à 09:20, Eugen Block a écrit :
>>>> Have you increased the debug level for the mgr? It would surprise
>>>> me if stray daemons would really block an upgrade. But debug logs
>>>> might reveal something. And if it can be confirmed that the strays
>>>> really block the upgrade, you could either remove the OSDs entirely
>>>> (they are already drained) to continue upgrading, or create a
>>>> tracker issue to report this and wait for instructions.
>>>>
>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>
>>>>> Hi Eugen,
>>>>>
>>>>> Yes I stopped and restarted the upgrade several times already, in
>>>>> particular after failing over the mgr. And the only messages
>>>>> related are the upgrade started and upgrade canceled ones. Nothing
>>>>> related to an error or a crash...
>>>>>
>>>>> For me the question is why do I have stray daemons after removing
>>>>> OSD. IMO it is unexpected as these daemons are not there anymore.
>>>>> I can understand that stray daemons prevent the upgrade to start
>>>>> if they are really strayed... And it would be nice if cephadm was
>>>>> giving a message about why the upgrade does not really start
>>>>> despite its status is "in progress"...
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Michel
>>>>> Sent from my mobile
>>>>> Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit :
>>>>>
>>>>>> Do you see anything in the mgr log? To get fresh logs I would cancel
>>>>>> the upgrade (ceph orch upgrade stop) and then try again.
>>>>>> A workaround could be to manually upgrade the mgr daemons by changing
>>>>>> their unit.run file, but that would be my last resort. Btwm did you
>>>>>> stop and start the upgrade after failing the mgr as well?
>>>>>>
>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>
>>>>>>> Eugen,
>>>>>>>
>>>>>>> Thanks for the hint. Here is the osd_remove_queue:
>>>>>>>
>>>>>>> [root@ijc-mon1 ~]# ceph config-key get mgr/cephadm/osd_remove_queue|jq
>>>>>>> [
>>>>>>>  {
>>>>>>>    "osd_id": 253,
>>>>>>>    "started": true,
>>>>>>>    "draining": false,
>>>>>>>    "stopped": false,
>>>>>>>    "replace": true,
>>>>>>>    "force": false,
>>>>>>>    "zap": true,
>>>>>>>    "hostname": "dig-osd4",
>>>>>>>    "drain_started_at": null,
>>>>>>>    "drain_stopped_at": null,
>>>>>>>    "drain_done_at": "2025-04-15T14:09:30.521534Z",
>>>>>>>    "process_started_at": "2025-04-15T14:09:14.091592Z"
>>>>>>>  },
>>>>>>>  {
>>>>>>>    "osd_id": 381,
>>>>>>>    "started": true,
>>>>>>>    "draining": false,
>>>>>>>    "stopped": false,
>>>>>>>    "replace": true,
>>>>>>>    "force": false,
>>>>>>>    "zap": false,
>>>>>>>    "hostname": "dig-osd6",
>>>>>>>    "drain_started_at": "2025-04-23T11:56:09.864724Z",
>>>>>>>    "drain_stopped_at": null,
>>>>>>>    "drain_done_at": "2025-04-25T06:53:03.678729Z",
>>>>>>>    "process_started_at": "2025-04-23T11:56:05.924923Z"
>>>>>>>  }
>>>>>>> ]
>>>>>>>
>>>>>>> It is not empty the two stray daemons are listed. Not sure it these
>>>>>>> entries are expected as I specified --replace... A similar issue was
>>>>>>> reported in https://tracker.ceph.com/issues/67018 so before Reef but
>>>>>>> the cause may be different. Still not clear for me how to get out of
>>>>>>> this, except may be replacing the OSDs but this will take some time...
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Michel
>>>>>>>
>>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit :
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> what's the current ceph status? Wasn't there a bug in early Reef
>>>>>>>> versions preventing upgrades if there were removed OSDs in the
>>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check
>>>>>>>>
>>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue
>>>>>>>>
>>>>>>>> And then I would check the mgr log, maybe set it to a higher debug
>>>>>>>> level to see what's blocking it.
>>>>>>>>
>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2 standby)
>>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't
>>>>>>>>> really understand why I get these stray daemons after doing a
>>>>>>>>> 'ceph orch osd rm --replace` but I think I have always seen this.
>>>>>>>>> I tried to mute rather than disable the stray daemon check but it
>>>>>>>>> doesn't help either. And I find strange this message every 10s
>>>>>>>>> about one of the destroyed OSD and only one, reporting it is down
>>>>>>>>> and already destroyed and saying it'll zap it (I think I didn't
>>>>>>>>> add --zap when I removed it as the underlying disk is dead).
>>>>>>>>>
>>>>>>>>> I'm completely stuck with this upgrade and I don't remember having
>>>>>>>>> this kind of problems in previous upgrades with cephadm... Any
>>>>>>>>> idea where to look for the cause and/or how to fix it?
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Michel
>>>>>>>>>
>>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6,
>>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph orch
>>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message saying
>>>>>>>>>> that the upgrade has been initiated, with a similar message in
>>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade
>>>>>>>>>> status' says:
>>>>>>>>>>
>>>>>>>>>> -------
>>>>>>>>>>
>>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status
>>>>>>>>>> {
>>>>>>>>>>    "target_image": "quay.io/ceph/ceph:v18.2.6",
>>>>>>>>>>    "in_progress": true,
>>>>>>>>>>    "which": "Upgrading all daemon types on all hosts",
>>>>>>>>>>    "services_complete": [],
>>>>>>>>>>    "progress": "",
>>>>>>>>>>    "message": "",
>>>>>>>>>>    "is_paused": false
>>>>>>>>>> }
>>>>>>>>>> -------
>>>>>>>>>>
>>>>>>>>>> The first time I entered the command, the cluster status was
>>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs,
>>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false
>>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK but it
>>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to failover
>>>>>>>>>> the mgr, stopped an restarted the upgrade but without any
>>>>>>>>>> improvement. I have not seen anything in the logs, except that
>>>>>>>>>> there is an INF entry every 10s about the destroyed OSD saying:
>>>>>>>>>>
>>>>>>>>>> ------
>>>>>>>>>>
>>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down
>>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on dig-osd4
>>>>>>>>>> was already removed
>>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully destroyed old
>>>>>>>>>> osd.253 on dig-osd4; ready for replacement
>>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for osd.253
>>>>>>>>>> on dig-osd4
>>>>>>>>>> -----
>>>>>>>>>>
>>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs
>>>>>>>>>> since I restarted the mgr. May this be the cause for the stucked
>>>>>>>>>> upgrade? What can I do for fixing this?
>>>>>>>>>>
>>>>>>>>>> Thanks in advance for any hint. Best regards,
>>>>>>>>>>
>>>>>>>>>> Michel
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Reply via email to