Hi,
I tried to restart all the mgrs (we have 3, 1 active, 2 standby) by
executing 3 times the `ceph mgr fail`, no impact. I don't really
understand why I get these stray daemons after doing a 'ceph orch osd rm
--replace` but I think I have always seen this. I tried to mute rather
than disable the stray daemon check but it doesn't help either. And I
find strange this message every 10s about one of the destroyed OSD and
only one, reporting it is down and already destroyed and saying it'll
zap it (I think I didn't add --zap when I removed it as the underlying
disk is dead).
I'm completely stuck with this upgrade and I don't remember having this
kind of problems in previous upgrades with cephadm... Any idea where to
look for the cause and/or how to fix it?
Best regards,
Michel
Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
Hi,
I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6, using
'ceph orch upgrade'. When I enter the command 'ceph orch upgrade start
--ceph-version 18.2.6', I receive a message saying that the upgrade
has been initiated, with a similar message in the logs but nothing
happens after this. 'ceph orch upgrade status' says:
-------
[root@ijc-mon1 ~]# ceph orch upgrade status
{
"target_image": "quay.io/ceph/ceph:v18.2.6",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [],
"progress": "",
"message": "",
"is_paused": false
}
-------
The first time I entered the command, the cluster status was
HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs, rm
--replace). I set mgr/cephadm/warn_on_stray_daemons to false to ignore
these 2 daemons, the cluster is now HEALTH_OK but it doesn't help.
Following a Red Hat KB entry, I tried to failover the mgr, stopped an
restarted the upgrade but without any improvement. I have not seen
anything in the logs, except that there is an INF entry every 10s
about the destroyed OSD saying:
------
2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14079 : cephadm [INF] osd.253 now down
2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14080 : cephadm [INF] Daemon osd.253 on dig-osd4 was already removed
2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14081 : cephadm [INF] Successfully destroyed old osd.253 on dig-osd4;
ready for replacement
2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14082 : cephadm [INF] Zapping devices for osd.253 on dig-osd4
-----
The message seems to be only for one of the 2 destroyed OSDs since I
restarted the mgr. May this be the cause for the stucked upgrade? What
can I do for fixing this?
Thanks in advance for any hint. Best regards,
Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io