This seems to have worked to get the orch back up and put me back to 16.2.15. Thank you. Debating on waiting for 18.2.5 to move forward.
-jeremy > On Monday, Apr 07, 2025 at 1:26 AM, Eugen Block <ebl...@nde.ag > (mailto:ebl...@nde.ag)> wrote: > Still no, just edit the unit.run file for the MGRs to use a different > image. See Frédéric's instructions (now that I'm re-reading it, > there's a little mistake with dots and hyphens): > > # Backup the unit.run file > $ cp /var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run{,.bak} > > # Change container image's signature. You can get the signature of the > version you > want to reach from https://quay.io/repository/ceph/ceph?tab=tags. It's > in the URL of a > version. > $ sed -i > 's/ceph@sha256:e40c19cd70e047d14d70f5ec3cf501da081395a670cd59ca881ff56119660c8f/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9/g' > /var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run > > # Restart the container (systemctl daemon-reload not needed) > $ systemctl restart ceph-$(ceph fsid)(a)mgr.ceph01.eydqvm.service > > # Run this command a few times and it should show the new version > ceph orch ps --refresh --hostname ceph01 | grep mgr > > To get the image signature, you can also look into the other unit.run > files, a version tag would also work. > > It depends on how often you need the orchestrator to maintain the > cluster. If you have the time, you could wait a bit longer for other > responses. If you need the orchestrator in the meantime, you can roll > back the MGRs. > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/ > > Zitat von Jeremy Hansen <jer...@skidrow.la>: > > > Thank you. The only thing I’m unclear on is the rollback to pacific. > > > > Are you referring to > > > > > > > > > > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon > > > > Thank you. I appreciate all the help. Should I wait for Adam to > > comment? At the moment, the cluster is functioning enough to > > maintain running vms, so if it’s wise to wait, I can do that. > > > > -jeremy > > > > > On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block <ebl...@nde.ag > > > (mailto:ebl...@nde.ag)> wrote: > > > I haven't tried it this way yet, and I had hoped that Adam would chime > > > in, but my approach would be to remove this key (it's not present when > > > no upgrade is in progress): > > > > > > ceph config-key rm mgr/cephadm/upgrade_state > > > > > > Then rollback the two newer MGRs to Pacific as described before. If > > > they come up healthy, test if the orchestrator works properly first. > > > For example, remove a node-exporter or crash or anything else > > > uncritical and let it redeploy. > > > If that works, try a staggered upgrade, starting with the MGRs only: > > > > > > ceph orch upgrade start --image <image-name> --daemon-types mgr > > > > > > Since there's no need to go to Quincy, I suggest to upgrade to Reef > > > 18.2.4 (or you wait until 18.2.5 is released, which should be very > > > soon), so set the respective <image-name> in the above command. > > > > > > If all three MGRs successfully upgrade, you can continue with the > > > MONs, or with the entire rest. > > > > > > In production clusters, I usually do staggered upgrades, e. g. I limit > > > the number of OSD daemons first just to see if they come up healthy, > > > then I let it upgrade all other OSDs automatically. > > > > > > https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>: > > > > > > > Snipped some of the irrelevant logs to keep message size down. > > > > > > > > ceph config-key get mgr/cephadm/upgrade_state > > > > > > > > {"target_name": "quay.io/ceph/ceph:v17.2.0", "progress_id": > > > > "e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id": > > > > "e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9", > > > > "target_digests": > > > > > > > ["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", > > > > > > "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], > > > "target_version": "17.2.0", "fs_original_max_mds": null, > > > "fs_original_allow_standby_replay": null, "error": null, "paused": false, > > > "daemon_types": null, "hosts": null, "services": null, "total_count": > > > null, > > > "remaining_count": > > > > null} > > > > > > > > What should I do next? > > > > > > > > Thank you! > > > > -jeremy > > > > > > > > > On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block <ebl...@nde.ag > > > > > (mailto:ebl...@nde.ag)> wrote: > > > > > Can you check if you have this config-key? > > > > > > > > > > ceph config-key get mgr/cephadm/upgrade_state > > > > > > > > > > If you reset the MGRs, it might be necessary to clear this key, > > > > > otherwise you might end up in some inconsistency. Just to be sure. > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>: > > > > > > > > > > > Thanks. I’m trying to be extra careful since this cluster is > > > > > > actually in use. I’ll wait for your feedback. > > > > > > > > > > > > -jeremy > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 3:39 PM, Eugen Block <ebl...@nde.ag > > > > > > > (mailto:ebl...@nde.ag)> wrote: > > > > > > > No, that's not necessary, just edit the unit.run file for > > > the MGRs to > > > > > > > use a different image. See Frédéric's instructions: > > > > > > > > > > > > > > > > > > > > > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/ > > > > > > > > > > > > > > But I'm not entirely sure if you need to clear some > > > config-keys first > > > > > > > in order to reset the upgrade state. If I have time, I'll > > > try to check > > > > > > > tomorrow, or on Monday. > > > > > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>: > > > > > > > > > > > > > > > Would I follow this process to downgrade? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon > > > > > > > > > > > > > > > > Thank you > > > > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 2:04 PM, Jeremy Hansen > > > > > > > > > <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote: > > > > > > > > > ceph -s claims things are healthy: > > > > > > > > > > > > > > > > > > ceph -s > > > > > > > > > cluster: > > > > > > > > > id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1 > > > > > > > > > health: HEALTH_OK > > > > > > > > > > > > > > > > > > services: > > > > > > > > > mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h) > > > > > > > > > mgr: cn03.negzvb(active, since 26m), standbys: cn01.tjmtph, > > > > > > > > > cn02.ceph.xyz.corp.ggixgj > > > > > > > > > mds: 1/1 daemons up, 2 standby > > > > > > > > > osd: 15 osds: 15 up (since 19h), 15 in (since 14M) > > > > > > > > > > > > > > > > > > data: > > > > > > > > > volumes: 1/1 healthy > > > > > > > > > pools: 6 pools, 610 pgs > > > > > > > > > objects: 284.59k objects, 1.1 TiB > > > > > > > > > usage: 3.3 TiB used, 106 TiB / 109 TiB avail > > > > > > > > > pgs: 610 active+clean > > > > > > > > > > > > > > > > > > io: > > > > > > > > > client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s rd, 16 op/s wr > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > — > > > > > > > > > How do I downgrade if the orch is down? > > > > > > > > > > > > > > > > > > Thank you > > > > > > > > > -jeremy > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 1:56 PM, Eugen Block > > > <ebl...@nde.ag > > > > > > > > > (mailto:ebl...@nde.ag)> wrote: > > > > > > > > > > It would help if you only pasted the relevant parts. > > > > > Anyway, these two > > > > > > > > > > sections stand out: > > > > > > > > > > > > > > > > > > > > ---snip--- > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:48.909+0000 7f26f0200700 0 > > > > > [balancer INFO root] > > > > > > > > > > Some PGs (1.000000) are unknown; try again later > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr > > > > > load Failed to > > > > > > > > > > construct class in 'cephadm' > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgr > > > > > load Traceback > > > > > > > > > > (most recent call last): > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > File "/usr/share/ceph/mgr/cephadm/module.py", line 470, > > > > > in __init__ > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > self.upgrade = CephadmUpgrade(self) > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112, > > > > > in __init__ > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > self.upgrade_state: Optional[UpgradeState] = > > > > > > > > > > UpgradeState.from_json(json.loads(t)) > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93, > > > > > in from_json > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > return cls(**c) > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > TypeError: __init__() got an unexpected keyword argument > > > > > > > 'daemon_types' > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:48.918+0000 7f2663400700 -1 > > > mgr operator() > > > > > > > > > > Failed to run module in active mode ('cephadm') > > > > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr > > > > > load Failed to > > > > > > > > > > construct class in 'snap_schedule' > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgr > > > > > load Traceback > > > > > > > > > > (most recent call last): > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > File "/usr/share/ceph/mgr/snap_schedule/module.py", line 38, > > > > > > > in __init__ > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > self.client = SnapSchedClient(self) > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > File > > > > > "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line > > > > > > > > > > 158, in __init__ > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > with self.get_schedule_db(fs_name) as conn_mgr: > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > File > > > > > "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line > > > > > > > > > > 192, in get_schedule_db > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > db.executescript(dump) > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > sqlite3.OperationalError: table schedules already exists > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp > > > > > > > > > > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]: > > > > > > > > > > debug 2025-04-05T20:33:49.274+0000 7f2663400700 -1 > > > mgr operator() > > > > > > > > > > Failed to run module in active mode ('snap_schedule') > > > > > > > > > > ---snip--- > > > > > > > > > > > > > > > > > > > > Your cluster seems to be in an error state (ceph -s) > > > because of an > > > > > > > > > > unknown PG. It's recommended to have a healthy cluster > > > > > > > > > > before > > > > > > > > > > attemping an upgrade. It's possible that these errors > > > > > come from the > > > > > > > > > > not upgraded MGR, I'm not sure. > > > > > > > > > > > > > > > > > > > > Since the upgrade was only successful for two MGRs, I > > > am thinking > > > > > > > > > > about downgrading both MGRs back to 16.2.15, then retry > > > > > an upgrade to > > > > > > > > > > a newer version, either 17.2.8 or 18.2.4. I haven't > > > checked the > > > > > > > > > > snap_schedule error yet, though. Maybe someone else knows > > > > > > > that already. > > > > > > > > > > > > > > > > > > > > > > > > > >
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io