[ceph-users] Re: Cephadm upgrade from 16.2.15 -> 17.2.0

Jeremy Hansen Mon, 14 Apr 2025 00:23:40 -0700

Thanks. I’ll wait. I need this to go smoothly on another cluster that has to go 
through the same process.


-jeremy

> On Monday, Apr 14, 2025 at 12:10 AM, Eugen Block <ebl...@nde.ag 
> (mailto:ebl...@nde.ag)> wrote:
> Ah, this looks like the encryption issue which seems new in 18.2.5,
> brought up here:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UJ4DREAWNBBVVUJXYVZO25AYVQ5RLT42/
>
> In that case it's questionable if you really want to upgrade to
> 18.2.5. Maybe 18.2.4 would be more suitable, although it's missing bug
> fixes from .5 (like the RGW memory leak). If you really need to
> upgrade, I guess I would go with .4, otherwise stay on Pacific until
> this issue has been addressed. It's not an easy decision. ;-)
>
> Zitat von Jeremy Hansen <jer...@skidrow.la>:
>
> > I haven’t attempted the remaining upgrade just yet. I wanted to
> > check on this before proceeding. Things seem “stable” in the sense
> > that I’m running VMs and all volumes and images are still
> > functioning. I’m using whatever would have been the default from
> > 16.2.14. It seems to be from time to time because I receive nagios
> > alerts, which seem to eventually clear and then reappear.
> >
> > HEALTH_WARN Failed to apply 1 service(s): osd.cost_capacity
> > [WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s):
> > osd.cost_capacity
> > osd.cost_capacity: cephadm exited with an error code: 1,
> > stderr:Inferring config
> > /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/mon.cn02/config
> > Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
> > --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
> > --privileged --group-add=disk --init -e
> > CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c
> >  -e NODE_NAME=cn02.ceph.xyz.corp -e 
> > CEPH_VOLUME_OSDSPEC_AFFINITY=cost_capacity -e 
> > CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v 
> > /var/run/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/run/ceph:z -v 
> > /var/log/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/log/ceph:z -v 
> > /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/crash:/var/lib/ceph/crash:z
> >  -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v 
> > /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
> > /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v 
> > /tmp/ceph-tmp49jj8zoh:/etc/ceph/ceph.conf:z -v 
> > /tmp/ceph-tmp_9k8v5uj:/var/lib/ceph/bootstrap-osd/ceph.keyring:z 
> > quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c
> >  lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf --dmcrypt 
> > --yes
> > --no-systemd
> > /usr/bin/podman: stderr Traceback (most recent call last):
> > /usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 33, in <module>
> > /usr/bin/podman: stderr
> > sys.exit(load_entry_point('ceph-volume==1.0.0', 'console_scripts',
> > 'ceph-volume')())
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 54, in
> > __init__
> > /usr/bin/podman: stderr self.main(self.argv)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/decorators.py", line
> > 59, in newfunc
> > /usr/bin/podman: stderr return f(*a, **kw)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 166, in
> > main
> > /usr/bin/podman: stderr terminal.dispatch(self.mapper, subcommand_args)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line
> > 194, in dispatch
> > /usr/bin/podman: stderr instance.main()
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/main.py",
> > line 46, in main
> > /usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line
> > 192, in dispatch
> > /usr/bin/podman: stderr instance = mapper.get(arg)(argv[count:])
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/batch.py",
> > line 325, in __init__
> > /usr/bin/podman: stderr self.args = parser.parse_args(argv)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 1825, in parse_args
> > /usr/bin/podman: stderr args, argv = self.parse_known_args(args, namespace)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 1858, in parse_known_args
> > /usr/bin/podman: stderr namespace, args =
> > self._parse_known_args(args, namespace)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 2067, in _parse_known_args
> > /usr/bin/podman: stderr start_index = consume_optional(start_index)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 2007, in consume_optional
> > /usr/bin/podman: stderr take_action(action, args, option_string)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 1935, in take_action
> > /usr/bin/podman: stderr action(self, namespace, argument_values,
> > option_string)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/util/arg_validators.py", line
> > 17, in __call__
> > /usr/bin/podman: stderr set_dmcrypt_no_workqueue()
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/util/encryption.py",
> > line 54, in set_dmcrypt_no_workqueue
> > /usr/bin/podman: stderr raise RuntimeError('Error while checking
> > cryptsetup version.\n',
> > /usr/bin/podman: stderr RuntimeError: ('Error while checking
> > cryptsetup version.\n', '`cryptsetup --version` output:\n',
> > 'cryptsetup 2.7.2 flags: UDEV BLKID KEYRING FIPS KERNEL_CAPI
> > PWQUALITY ')
> > Traceback (most recent call last):
> > File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
> > return _run_code(code, main_globals, None,
> > File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
> > exec(code, run_globals)
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 11009, in <module>
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 10997, in main
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2593, in
> > _infer_config
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2509, in _infer_fsid
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2621, in 
> > _infer_image
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2496, in
> > _validate_fsid
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 7226, in
> > command_ceph_volume
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2284, in call_throws
> > RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
> > --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
> > --privileged --group-add=disk --init -e
> > CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c
> >  -e NODE_NAME=cn02.ceph.xyz.corp -e 
> > CEPH_VOLUME_OSDSPEC_AFFINITY=cost_capacity -e 
> > CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v 
> > /var/run/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/run/ceph:z -v 
> > /var/log/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/log/ceph:z -v 
> > /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/crash:/var/lib/ceph/crash:z
> >  -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v 
> > /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
> > /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v 
> > /tmp/ceph-tmp49jj8zoh:/etc/ceph/ceph.conf:z -v 
> > /tmp/ceph-tmp_9k8v5uj:/var/lib/ceph/bootstrap-osd/ceph.keyring:z 
> > quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c
> >  lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf --dmcrypt 
> > --yes
> > --no-systemd
> >
> > —
> >
> > ceph orch ls osd --export
> > service_type: osd
> > service_id: all-available-devices
> > service_name: osd.all-available-devices
> > placement:
> > host_pattern: '*'
> > spec:
> > data_devices:
> > all: true
> > filter_logic: AND
> > objectstore: bluestore
> > ---
> > service_type: osd
> > service_id: cost_capacity
> > service_name: osd.cost_capacity
> > placement:
> > host_pattern: '*'
> > spec:
> > data_devices:
> > rotational: 1
> > encrypted: true
> > filter_logic: AND
> > objectstore: bluestore
> >
> > Thank you
> > -jeremy
> >
> > > On Sunday, Apr 13, 2025 at 11:48 PM, Eugen Block <ebl...@nde.ag
> > > (mailto:ebl...@nde.ag)> wrote:
> > > Are you using Rook? Usually, I see this warning when a host is not
> > > reachable, for example during a reboot. But it also clears when the
> > > host comes back. Do you see this permanently or from time to time? It
> > > might have to do with the different Ceph versions, I'm not sure. But
> > > it shouldn't be a show stopper for the remaining upgrade. Or are you
> > > trying to deploy OSDs but it fails? You can paste
> > >
> > > ceph health detail
> > > ceph orch ls osd --export
> > >
> > > You can also scan the cephadm.log for any hints.
> > >
> > >
> > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > >
> > > > This looks relevant.
> > > >
> > > > https://github.com/rook/rook/issues/13600#issuecomment-1905860331
> > > >
> > > > > On Sunday, Apr 13, 2025 at 10:08 AM, Jeremy Hansen
> > > > > <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote:
> > > > > I’m now seeing this:
> > > > >
> > > > > cluster:
> > > > > id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
> > > > > health: HEALTH_WARN
> > > > > Failed to apply 1 service(s): osd.cost_capacity
> > > > >
> > > > >
> > > > > I’m assuming this is due to the fact that I’ve only upgraded mgr
> > > > > but I wanted to double check before proceeding with the rest of the
> > > > > components.
> > > > >
> > > > > Thanks
> > > > > -jeremy
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > On Sunday, Apr 13, 2025 at 12:59 AM, Jeremy Hansen
> > > > > <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote:
> > > > > > Updating mgr’s to 18.2.5 seemed to work just fine. I will go for
> > > > > the remaining services after the weekend. Thanks.
> > > > > >
> > > > > > -jeremy
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Thursday, Apr 10, 2025 at 6:37 AM, Eugen Block
> > > > > <ebl...@nde.ag (mailto:ebl...@nde.ag)> wrote:
> > > > > > > Glad I could help! I'm also waiting for 18.2.5 to upgrade our own
> > > > > > > cluster from Pacific after getting rid of our cache tier. :-D
> > > > > > >
> > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > >
> > > > > > > > This seems to have worked to get the orch back up and put
> > > me back to
> > > > > > > > 16.2.15. Thank you. Debating on waiting for 18.2.5 to
> > > move forward.
> > > > > > > >
> > > > > > > > -jeremy
> > > > > > > >
> > > > > > > > > On Monday, Apr 07, 2025 at 1:26 AM, Eugen Block <ebl...@nde.ag
> > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > Still no, just edit the unit.run file for the MGRs to use a
> > > > > different
> > > > > > > > > image. See Frédéric's instructions (now that I'm re-reading 
> > > > > > > > > it,
> > > > > > > > > there's a little mistake with dots and hyphens):
> > > > > > > > >
> > > > > > > > > # Backup the unit.run file
> > > > > > > > > $ cp /var/lib/ceph/$(ceph
> > > fsid)/mgr.ceph01.eydqvm/unit.run{,.bak}
> > > > > > > > >
> > > > > > > > > # Change container image's signature. You can get the
> > > > > signature of the
> > > > > > > > > version you
> > > > > > > > > want to reach from
> > > > > https://quay.io/repository/ceph/ceph?tab=tags. It's
> > > > > > > > > in the URL of a
> > > > > > > > > version.
> > > > > > > > > $ sed -i
> > > > > > > > >
> > > > >
> > > 's/ceph@sha256:e40c19cd70e047d14d70f5ec3cf501da081395a670cd59ca881ff56119660c8f/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9/g'
> > > > > > > > > /var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run
> > > > > > > > >
> > > > > > > > > # Restart the container (systemctl daemon-reload not needed)
> > > > > > > > > $ systemctl restart ceph-$(ceph
> > > fsid)(a)mgr.ceph01.eydqvm.service
> > > > > > > > >
> > > > > > > > > # Run this command a few times and it should show the
> > > new version
> > > > > > > > > ceph orch ps --refresh --hostname ceph01 | grep mgr
> > > > > > > > >
> > > > > > > > > To get the image signature, you can also look into the
> > > > > other unit.run
> > > > > > > > > files, a version tag would also work.
> > > > > > > > >
> > > > > > > > > It depends on how often you need the orchestrator to
> > > maintain the
> > > > > > > > > cluster. If you have the time, you could wait a bit
> > > longer for other
> > > > > > > > > responses. If you need the orchestrator in the meantime,
> > > > > you can roll
> > > > > > > > > back the MGRs.
> > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/
> > > > > > > > >
> > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > >
> > > > > > > > > > Thank you. The only thing I’m unclear on is the rollback
> > > > > to pacific.
> > > > > > > > > >
> > > > > > > > > > Are you referring to
> > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
> > > > > > > > > >
> > > > > > > > > > Thank you. I appreciate all the help. Should I wait
> > > for Adam to
> > > > > > > > > > comment? At the moment, the cluster is functioning enough to
> > > > > > > > > > maintain running vms, so if it’s wise to wait, I can do 
> > > > > > > > > > that.
> > > > > > > > > >
> > > > > > > > > > -jeremy
> > > > > > > > > >
> > > > > > > > > > > On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block
> > > <ebl...@nde.ag
> > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > I haven't tried it this way yet, and I had hoped that
> > > > > Adam would chime
> > > > > > > > > > > in, but my approach would be to remove this key (it's
> > > > > not present when
> > > > > > > > > > > no upgrade is in progress):
> > > > > > > > > > >
> > > > > > > > > > > ceph config-key rm mgr/cephadm/upgrade_state
> > > > > > > > > > >
> > > > > > > > > > > Then rollback the two newer MGRs to Pacific as
> > > > > described before. If
> > > > > > > > > > > they come up healthy, test if the orchestrator works
> > > > > properly first.
> > > > > > > > > > > For example, remove a node-exporter or crash or
> > > anything else
> > > > > > > > > > > uncritical and let it redeploy.
> > > > > > > > > > > If that works, try a staggered upgrade, starting with
> > > > > the MGRs only:
> > > > > > > > > > >
> > > > > > > > > > > ceph orch upgrade start --image <image-name>
> > > --daemon-types mgr
> > > > > > > > > > >
> > > > > > > > > > > Since there's no need to go to Quincy, I suggest to
> > > > > upgrade to Reef
> > > > > > > > > > > 18.2.4 (or you wait until 18.2.5 is released, which
> > > > > should be very
> > > > > > > > > > > soon), so set the respective <image-name> in the
> > > above command.
> > > > > > > > > > >
> > > > > > > > > > > If all three MGRs successfully upgrade, you can
> > > > > continue with the
> > > > > > > > > > > MONs, or with the entire rest.
> > > > > > > > > > >
> > > > > > > > > > > In production clusters, I usually do staggered
> > > > > upgrades, e. g. I limit
> > > > > > > > > > > the number of OSD daemons first just to see if they
> > > > > come up healthy,
> > > > > > > > > > > then I let it upgrade all other OSDs automatically.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade
> > > > > > > > > > >
> > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > > > >
> > > > > > > > > > > > Snipped some of the irrelevant logs to keep
> > > message size down.
> > > > > > > > > > > >
> > > > > > > > > > > > ceph config-key get mgr/cephadm/upgrade_state
> > > > > > > > > > > >
> > > > > > > > > > > > {"target_name": "quay.io/ceph/ceph:v17.2.0",
> > > "progress_id":
> > > > > > > > > > > > "e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id":
> > > > > > > > > > > >
> > > > > "e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9",
> > > > > > > > > > > > "target_digests":
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > ["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a",
> > >  
> > > "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"],
> > >  "target_version": "17.2.0", "fs_original_max_mds": null, 
> > > "fs_original_allow_standby_replay": null, "error": null, "paused": false, 
> > > "daemon_types": null, "hosts": null, "services":
> > > null,
> > > > > "total_count":
> > > > > > > > > null,
> > > > > > > > > > > "remaining_count":
> > > > > > > > > > > > null}
> > > > > > > > > > > >
> > > > > > > > > > > > What should I do next?
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you!
> > > > > > > > > > > > -jeremy
> > > > > > > > > > > >
> > > > > > > > > > > > > On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block
> > > > > <ebl...@nde.ag
> > > > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > > > Can you check if you have this config-key?
> > > > > > > > > > > > >
> > > > > > > > > > > > > ceph config-key get mgr/cephadm/upgrade_state
> > > > > > > > > > > > >
> > > > > > > > > > > > > If you reset the MGRs, it might be necessary to
> > > > > clear this key,
> > > > > > > > > > > > > otherwise you might end up in some inconsistency.
> > > > > Just to be sure.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks. I’m trying to be extra careful since this
> > > > > cluster is
> > > > > > > > > > > > > > actually in use. I’ll wait for your feedback.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -jeremy
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 3:39 PM, Eugen
> > > > > Block <ebl...@nde.ag
> > > > > > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > > > > > No, that's not necessary, just edit the
> > > > > unit.run file for
> > > > > > > > > > > the MGRs to
> > > > > > > > > > > > > > > use a different image. See Frédéric's 
> > > > > > > > > > > > > > > instructions:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > But I'm not entirely sure if you need to clear 
> > > > > > > > > > > > > > > some
> > > > > > > > > > > config-keys first
> > > > > > > > > > > > > > > in order to reset the upgrade state. If I have
> > > > > time, I'll
> > > > > > > > > > > try to check
> > > > > > > > > > > > > > > tomorrow, or on Monday.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Would I follow this process to downgrade?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 2:04 PM,
> > > Jeremy Hansen
> > > > > > > > > > > > > > > > > <jer...@skidrow.la
> > > > > (mailto:jer...@skidrow.la)> wrote:
> > > > > > > > > > > > > > > > > ceph -s claims things are healthy:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ceph -s
> > > > > > > > > > > > > > > > > cluster:
> > > > > > > > > > > > > > > > > id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
> > > > > > > > > > > > > > > > > health: HEALTH_OK
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > services:
> > > > > > > > > > > > > > > > > mon: 3 daemons, quorum cn01,cn03,cn02 (age 
> > > > > > > > > > > > > > > > > 20h)
> > > > > > > > > > > > > > > > > mgr: cn03.negzvb(active, since 26m),
> > > > > standbys: cn01.tjmtph,
> > > > > > > > > > > > > > > > > cn02.ceph.xyz.corp.ggixgj
> > > > > > > > > > > > > > > > > mds: 1/1 daemons up, 2 standby
> > > > > > > > > > > > > > > > > osd: 15 osds: 15 up (since 19h), 15 in
> > > (since 14M)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > data:
> > > > > > > > > > > > > > > > > volumes: 1/1 healthy
> > > > > > > > > > > > > > > > > pools: 6 pools, 610 pgs
> > > > > > > > > > > > > > > > > objects: 284.59k objects, 1.1 TiB
> > > > > > > > > > > > > > > > > usage: 3.3 TiB used, 106 TiB / 109 TiB avail
> > > > > > > > > > > > > > > > > pgs: 610 active+clean
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > io:
> > > > > > > > > > > > > > > > > client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s
> > > > > rd, 16 op/s wr
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > —
> > > > > > > > > > > > > > > > > How do I downgrade if the orch is down?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you
> > > > > > > > > > > > > > > > > -jeremy
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 1:56 PM,
> > > Eugen Block
> > > > > > > > > > > <ebl...@nde.ag
> > > > > > > > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > > > > > > > > It would help if you only pasted the
> > > > > relevant parts.
> > > > > > > > > > > > > Anyway, these two
> > > > > > > > > > > > > > > > > > sections stand out:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ---snip---
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.909+0000
> > > 7f26f0200700 0
> > > > > > > > > > > > > [balancer INFO root]
> > > > > > > > > > > > > > > > > > Some PGs (1.000000) are unknown; try
> > > again later
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.917+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Failed to
> > > > > > > > > > > > > > > > > > construct class in 'cephadm'
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.917+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Traceback
> > > > > > > > > > > > > > > > > > (most recent call last):
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > "/usr/share/ceph/mgr/cephadm/module.py", line 470,
> > > > > > > > > > > > > in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > self.upgrade = CephadmUpgrade(self)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112,
> > > > > > > > > > > > > in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > self.upgrade_state: Optional[UpgradeState] =
> > > > > > > > > > > > > > > > > > UpgradeState.from_json(json.loads(t))
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93,
> > > > > > > > > > > > > in from_json
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > return cls(**c)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > TypeError: __init__() got an unexpected
> > > > > keyword argument
> > > > > > > > > > > > > > > 'daemon_types'
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.918+0000
> > > 7f2663400700 -1
> > > > > > > > > > > mgr operator()
> > > > > > > > > > > > > > > > > > Failed to run module in active mode
> > > ('cephadm')
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:49.273+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Failed to
> > > > > > > > > > > > > > > > > > construct class in 'snap_schedule'
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:49.273+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Traceback
> > > > > > > > > > > > > > > > > > (most recent call last):
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > > > > > "/usr/share/ceph/mgr/snap_schedule/module.py", line 38,
> > > > > > > > > > > > > > > in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > self.client = SnapSchedClient(self)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > > > > > > > > >
> > > > > "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line
> > > > > > > > > > > > > > > > > > 158, in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > with self.get_schedule_db(fs_name) as
> > > conn_mgr:
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > > > > > > > > >
> > > > > "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line
> > > > > > > > > > > > > > > > > > 192, in get_schedule_db
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > db.executescript(dump)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > sqlite3.OperationalError: table schedules
> > > > > already exists
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:49.274+0000
> > > 7f2663400700 -1
> > > > > > > > > > > mgr operator()
> > > > > > > > > > > > > > > > > > Failed to run module in active mode
> > > > > ('snap_schedule')
> > > > > > > > > > > > > > > > > > ---snip---
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Your cluster seems to be in an error
> > > > > state (ceph -s)
> > > > > > > > > > > because of an
> > > > > > > > > > > > > > > > > > unknown PG. It's recommended to have a 
> > > > > > > > > > > > > > > > > > healthy
> > > > > > > > > cluster before
> > > > > > > > > > > > > > > > > > attemping an upgrade. It's possible that
> > > > > these errors
> > > > > > > > > > > > > come from the
> > > > > > > > > > > > > > > > > > not upgraded MGR, I'm not sure.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Since the upgrade was only successful for
> > > > > two MGRs, I
> > > > > > > > > > > am thinking
> > > > > > > > > > > > > > > > > > about downgrading both MGRs back to
> > > > > 16.2.15, then retry
> > > > > > > > > > > > > an upgrade to
> > > > > > > > > > > > > > > > > > a newer version, either 17.2.8 or 18.2.4.
> > > > > I haven't
> > > > > > > > > > > checked the
> > > > > > > > > > > > > > > > > > snap_schedule error yet, though. Maybe
> > > > > someone else knows
> > > > > > > > > > > > > > > that already.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > >
> > >
> > >
>
>
>

signature.asc
Description: PGP signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cephadm upgrade from 16.2.15 -> 17.2.0

Reply via email to