Hi,

I tried to restart all the mgrs (we have 3, 1 active, 2 standby) by executing 3 times the `ceph mgr fail`, no impact. I don't really understand why I get these stray daemons after doing a 'ceph orch osd rm --replace` but I think I have always seen this. I tried to mute rather than disable the stray daemon check but it doesn't help either. And I find strange this message every 10s about one of the destroyed OSD and only one, reporting it is down and already destroyed and saying it'll zap it (I think I didn't add --zap when I removed it as the underlying disk is dead).

I'm completely stuck with this upgrade and I don't remember having this kind of problems in previous upgrades with cephadm... Any idea where to look for the cause and/or how to fix it?

Best regards,

Michel

Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
Hi,

I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6, using 'ceph orch upgrade'. When I enter the command 'ceph orch upgrade start --ceph-version 18.2.6', I receive a message saying that the upgrade has been initiated, with a similar message in the logs but nothing happens after this. 'ceph orch upgrade status' says:

-------

[root@ijc-mon1 ~]# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v18.2.6",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "",
    "is_paused": false
}
-------

The first time I entered the command, the cluster status was HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs, rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false to ignore these 2 daemons, the cluster is now HEALTH_OK but it doesn't help. Following a Red Hat KB entry, I tried to failover the mgr, stopped an restarted the upgrade but without any improvement. I have not seen anything in the logs, except that there is an INF entry every 10s about the destroyed OSD saying:

------

2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028) 14079 : cephadm [INF] osd.253 now down 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on dig-osd4 was already removed 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028) 14081 : cephadm [INF] Successfully destroyed old osd.253 on dig-osd4; ready for replacement 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028) 14082 : cephadm [INF] Zapping devices for osd.253 on dig-osd4
-----

The message seems to be only for one of the 2 destroyed OSDs since I restarted the mgr. May this be the cause for the stucked upgrade? What can I do for fixing this?

Thanks in advance for any hint. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to