Hi,

I did a quick search on tracker but couldn't find anything related. A customer reported this, and I can confirm the behaviour on a lab cluster. I usually perform a staggered upgrade with --daemon-types and --limit, but not for all daemon types. So I haven't stumbled across this yet myself, but our customer did (I can reproduce with --daemon-types as well). They upgraded from latest Reef to latest Squid and reported that despite providing the --limit parameter, the mons were all upgraded. So I tried to reproduce, and the behaviour is not really clear to me, I'll try to clarify.

# Start with MGRs

reef1:~ # ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.3 --services mgr --limit 1

Upgrading MGRs with limit works, but it doesn't reflect in the MGR log. Usually, I expect a line like this:

...[cephadm INFO root] Hit upgrade limit of 1. Stopping upgrade

But there is no such line in the logs. Then I upgrade the rest of the MGRs.

# Continue with MONs

reef1:~ # ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.3 --services mon --limit 1

This gets even weirder, the orchestrator upgrades 2 out of 3 MONs. And again, no such line in the log (Hit upgrade limit). What I noticed was a MGR respawn after the first MON had been upgraded successfully. Maybe some state of the upgrade progress gets lost during the respawn? I then upgraded the remaining MON. Then ceph-crash is upgraded successfully.

# Upgrade OSD

reef1:~ # ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.3 --services osd.osd.standalone --limit 1

And this is the first service that actually reports the upgrade limit:

2026-03-05T15:47:17.108+0000 7f5bfdc73640 0 [cephadm INFO cephadm.upgrade] Upgrade: Updating osd.0 (1/1) 2026-03-05T15:47:17.108+0000 7f5bfdc73640 0 log_channel(cephadm) log [INF] : Upgrade: Updating osd.0 (1/1) 2026-03-05T15:47:32.518+0000 7f5bfdc73640 0 [cephadm INFO root] Hit upgrade limit of 1. Stopping upgrade 2026-03-05T15:47:32.518+0000 7f5bfdc73640 0 log_channel(cephadm) log [INF] : Hit upgrade limit of 1. Stopping upgrade 2026-03-05T15:47:47.395+0000 7f5bfdc73640 0 [cephadm INFO cephadm.upgrade] Upgrade: Setting container_image for all nvmeof 2026-03-05T15:47:47.395+0000 7f5bfdc73640 0 log_channel(cephadm) log [INF] : Upgrade: Setting container_image for all nvmeof 2026-03-05T15:47:47.492+0000 7f5bfdc73640 0 [cephadm INFO cephadm.upgrade] Upgrade: Finalizing container_image settings 2026-03-05T15:47:47.493+0000 7f5bfdc73640 0 log_channel(cephadm) log [INF] : Upgrade: Finalizing container_image settings 2026-03-05T15:47:47.667+0000 7f5bfdc73640 0 [cephadm INFO cephadm.upgrade] Upgrade: Complete! 2026-03-05T15:47:47.667+0000 7f5bfdc73640 0 log_channel(cephadm) log [INF] : Upgrade: Complete!


This is really irritating and inconsistent: If the orchestrator does honor --limit with other services than OSD, why isn't that visible in the logs? And what's with the MONs? Why 2 out of 3?

Any pointers appreciated! Not sure which Ceph versions might be affected by this, I'll try out a couple more upgrade paths.

Thanks,
Eugen
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to