[ceph-users] Re: Cephadm upgrade from 16.2.15 -> 17.2.0

Eugen Block Mon, 07 Apr 2025 00:25:20 -0700

I haven't tried it this way yet, and I had hoped that Adam would chimein, but my approach would be to remove this key (it's not present whenno upgrade is in progress):


ceph config-key rm mgr/cephadm/upgrade_state

Then rollback the two newer MGRs to Pacific as described before. Ifthey come up healthy, test if the orchestrator works properly first.For example, remove a node-exporter or crash or anything elseuncritical and let it redeploy.

If that works, try a staggered upgrade, starting with the MGRs only:

ceph orch upgrade start --image <image-name> --daemon-types mgr

Since there's no need to go to Quincy, I suggest to upgrade to Reef18.2.4 (or you wait until 18.2.5 is released, which should be verysoon), so set the respective <image-name> in the above command.

If all three MGRs successfully upgrade, you can continue with theMONs, or with the entire rest.

In production clusters, I usually do staggered upgrades, e. g. I limitthe number of OSD daemons first just to see if they come up healthy,then I let it upgrade all other OSDs automatically.


https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade

Zitat von Jeremy Hansen <jer...@skidrow.la>:

Snipped some of the irrelevant logs to keep message size down.

ceph config-key get mgr/cephadm/upgrade_state

{"target_name": "quay.io/ceph/ceph:v17.2.0", "progress_id":"e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id":"e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9","target_digests":["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services": null, "total_count": null, "remaining_count":null}


What should I do next?

Thank you!
-jeremy

On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block <ebl...@nde.ag(mailto:ebl...@nde.ag)> wrote:

Can you check if you have this config-key?

ceph config-key get mgr/cephadm/upgrade_state

If you reset the MGRs, it might be necessary to clear this key,
otherwise you might end up in some inconsistency. Just to be sure.

Zitat von Jeremy Hansen <jer...@skidrow.la>:

> Thanks. I’m trying to be extra careful since this cluster is
> actually in use. I’ll wait for your feedback.
>
> -jeremy
>
> > On Saturday, Apr 05, 2025 at 3:39 PM, Eugen Block <ebl...@nde.ag
> > (mailto:ebl...@nde.ag)> wrote:
> > No, that's not necessary, just edit the unit.run file for the MGRs to
> > use a different image. See Frédéric's instructions:
> >

> >https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/

> >
> > But I'm not entirely sure if you need to clear some config-keys first
> > in order to reset the upgrade state. If I have time, I'll try to check
> > tomorrow, or on Monday.
> >
> > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> >
> > > Would I follow this process to downgrade?
> > >
> > >

> >https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon

> > >
> > > Thank you
> > >
> > > > On Saturday, Apr 05, 2025 at 2:04 PM, Jeremy Hansen
> > > > <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote:
> > > > ceph -s claims things are healthy:
> > > >
> > > > ceph -s
> > > > cluster:
> > > > id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
> > > > health: HEALTH_OK
> > > >
> > > > services:
> > > > mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h)
> > > > mgr: cn03.negzvb(active, since 26m), standbys: cn01.tjmtph,
> > > > cn02.ceph.xyz.corp.ggixgj
> > > > mds: 1/1 daemons up, 2 standby
> > > > osd: 15 osds: 15 up (since 19h), 15 in (since 14M)
> > > >
> > > > data:
> > > > volumes: 1/1 healthy
> > > > pools: 6 pools, 610 pgs
> > > > objects: 284.59k objects, 1.1 TiB
> > > > usage: 3.3 TiB used, 106 TiB / 109 TiB avail
> > > > pgs: 610 active+clean
> > > >
> > > > io:
> > > > client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s rd, 16 op/s wr
> > > >
> > > >
> > > >
> > > > —
> > > > How do I downgrade if the orch is down?
> > > >
> > > > Thank you
> > > > -jeremy
> > > >
> > > >
> > > >
> > > > > On Saturday, Apr 05, 2025 at 1:56 PM, Eugen Block <ebl...@nde.ag
> > > > (mailto:ebl...@nde.ag)> wrote:

> > > > > It would help if you only pasted the relevant parts.Anyway, these two

> > > > > sections stand out:
> > > > >
> > > > > ---snip---
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > debug 2025-04-05T20:33:48.909+0000 7f26f0200700 0[balancer INFO root]

> > > > > Some PGs (1.000000) are unknown; try again later
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgrload Failed to

> > > > > construct class in 'cephadm'
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > debug 2025-04-05T20:33:48.917+0000 7f2663400700 -1 mgrload Traceback

> > > > > (most recent call last):
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > File "/usr/share/ceph/mgr/cephadm/module.py", line 470,in __init__

> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > self.upgrade = CephadmUpgrade(self)
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112,in __init__

> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > self.upgrade_state: Optional[UpgradeState] =
> > > > > UpgradeState.from_json(json.loads(t))
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > File "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93,in from_json

> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > return cls(**c)
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > TypeError: __init__() got an unexpected keyword argument
> > 'daemon_types'
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > debug 2025-04-05T20:33:48.918+0000 7f2663400700 -1 mgr operator()
> > > > > Failed to run module in active mode ('cephadm')
> > > > >
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgrload Failed to

> > > > > construct class in 'snap_schedule'
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > debug 2025-04-05T20:33:49.273+0000 7f2663400700 -1 mgrload Traceback

> > > > > (most recent call last):
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > File "/usr/share/ceph/mgr/snap_schedule/module.py", line 38,
> > in __init__
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > self.client = SnapSchedClient(self)
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > File"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line

> > > > > 158, in __init__
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > with self.get_schedule_db(fs_name) as conn_mgr:
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:

> > > > > File"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line

> > > > > 192, in get_schedule_db
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > db.executescript(dump)
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > sqlite3.OperationalError: table schedules already exists
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > debug 2025-04-05T20:33:49.274+0000 7f2663400700 -1 mgr operator()
> > > > > Failed to run module in active mode ('snap_schedule')
> > > > > ---snip---
> > > > >
> > > > > Your cluster seems to be in an error state (ceph -s) because of an
> > > > > unknown PG. It's recommended to have a healthy cluster before

> > > > > attemping an upgrade. It's possible that these errorscome from the

> > > > > not upgraded MGR, I'm not sure.
> > > > >
> > > > > Since the upgrade was only successful for two MGRs, I am thinking

> > > > > about downgrading both MGRs back to 16.2.15, then retryan upgrade to

> > > > > a newer version, either 17.2.8 or 18.2.4. I haven't checked the
> > > > > snap_schedule error yet, though. Maybe someone else knows
> > that already.
> >
> >



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cephadm upgrade from 16.2.15 -> 17.2.0

Reply via email to