Hi,
I did a quick search on tracker but couldn't find anything related. A
customer reported this, and I can confirm the behaviour on a lab
cluster. I usually perform a staggered upgrade with --daemon-types and
--limit, but not for all daemon types. So I haven't stumbled across
this yet myself, but our customer did (I can reproduce with
--daemon-types as well). They upgraded from latest Reef to latest
Squid and reported that despite providing the --limit parameter, the
mons were all upgraded. So I tried to reproduce, and the behaviour is
not really clear to me, I'll try to clarify.
# Start with MGRs
reef1:~ # ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.3
--services mgr --limit 1
Upgrading MGRs with limit works, but it doesn't reflect in the MGR
log. Usually, I expect a line like this:
...[cephadm INFO root] Hit upgrade limit of 1. Stopping upgrade
But there is no such line in the logs. Then I upgrade the rest of the MGRs.
# Continue with MONs
reef1:~ # ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.3
--services mon --limit 1
This gets even weirder, the orchestrator upgrades 2 out of 3 MONs. And
again, no such line in the log (Hit upgrade limit). What I noticed was
a MGR respawn after the first MON had been upgraded successfully.
Maybe some state of the upgrade progress gets lost during the respawn?
I then upgraded the remaining MON. Then ceph-crash is upgraded
successfully.
# Upgrade OSD
reef1:~ # ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.3
--services osd.osd.standalone --limit 1
And this is the first service that actually reports the upgrade limit:
2026-03-05T15:47:17.108+0000 7f5bfdc73640 0 [cephadm INFO
cephadm.upgrade] Upgrade: Updating osd.0 (1/1)
2026-03-05T15:47:17.108+0000 7f5bfdc73640 0 log_channel(cephadm) log
[INF] : Upgrade: Updating osd.0 (1/1)
2026-03-05T15:47:32.518+0000 7f5bfdc73640 0 [cephadm INFO root] Hit
upgrade limit of 1. Stopping upgrade
2026-03-05T15:47:32.518+0000 7f5bfdc73640 0 log_channel(cephadm) log
[INF] : Hit upgrade limit of 1. Stopping upgrade
2026-03-05T15:47:47.395+0000 7f5bfdc73640 0 [cephadm INFO
cephadm.upgrade] Upgrade: Setting container_image for all nvmeof
2026-03-05T15:47:47.395+0000 7f5bfdc73640 0 log_channel(cephadm) log
[INF] : Upgrade: Setting container_image for all nvmeof
2026-03-05T15:47:47.492+0000 7f5bfdc73640 0 [cephadm INFO
cephadm.upgrade] Upgrade: Finalizing container_image settings
2026-03-05T15:47:47.493+0000 7f5bfdc73640 0 log_channel(cephadm) log
[INF] : Upgrade: Finalizing container_image settings
2026-03-05T15:47:47.667+0000 7f5bfdc73640 0 [cephadm INFO
cephadm.upgrade] Upgrade: Complete!
2026-03-05T15:47:47.667+0000 7f5bfdc73640 0 log_channel(cephadm) log
[INF] : Upgrade: Complete!
This is really irritating and inconsistent: If the orchestrator does
honor --limit with other services than OSD, why isn't that visible in
the logs? And what's with the MONs? Why 2 out of 3?
Any pointers appreciated! Not sure which Ceph versions might be
affected by this, I'll try out a couple more upgrade paths.
Thanks,
Eugen
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]