Awesome, glad to hear it worked!
Regarding your question if you should upgrade further, it's not a simple "yes" or "no" question. Do you need features or bug fixes from Squid that are missing in Reef? Reef is still supported but it was just announced yesterday that it will be EOL in August:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AYCCOUUEHTKPAH3DVXYYXGO5WAFKNPKR/

I will upgrade our own cluster to Reef today, and I will wait for its EOL before I upgrade further. But I'm always a bit hesitant with "latest and greatest", so it's just my (conservative) opinion.

Zitat von Jeremy Hansen <jer...@skidrow.la>:

Just to follow this through, 18.2.6 fixed my issues and I was able to complete the upgrade. Is it advisable to go to 19 or should I stay on reef?

-jeremy

On Monday, Apr 14, 2025 at 12:14 AM, Jeremy Hansen <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote: Thanks. I’ll wait. I need this to go smoothly on another cluster that has to go through the same process.

-jeremy



> On Monday, Apr 14, 2025 at 12:10 AM, Eugen Block <ebl...@nde.ag (mailto:ebl...@nde.ag)> wrote:
> Ah, this looks like the encryption issue which seems new in 18.2.5,
> brought up here:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UJ4DREAWNBBVVUJXYVZO25AYVQ5RLT42/
>
> In that case it's questionable if you really want to upgrade to
> 18.2.5. Maybe 18.2.4 would be more suitable, although it's missing bug
> fixes from .5 (like the RGW memory leak). If you really need to
> upgrade, I guess I would go with .4, otherwise stay on Pacific until
> this issue has been addressed. It's not an easy decision. ;-)
>
> Zitat von Jeremy Hansen <jer...@skidrow.la>:
>
> > I haven’t attempted the remaining upgrade just yet. I wanted to
> > check on this before proceeding. Things seem “stable” in the sense
> > that I’m running VMs and all volumes and images are still
> > functioning. I’m using whatever would have been the default from
> > 16.2.14. It seems to be from time to time because I receive nagios
> > alerts, which seem to eventually clear and then reappear.
> >
> > HEALTH_WARN Failed to apply 1 service(s): osd.cost_capacity
> > [WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s):
> > osd.cost_capacity
> > osd.cost_capacity: cephadm exited with an error code: 1,
> > stderr:Inferring config
> > /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/mon.cn02/config
> > Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
> > --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
> > --privileged --group-add=disk --init -e
> > CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c -e NODE_NAME=cn02.ceph.xyz.corp -e CEPH_VOLUME_OSDSPEC_AFFINITY=cost_capacity -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/run/ceph:z -v /var/log/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/log/ceph:z -v /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v /tmp/ceph-tmp49jj8zoh:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp_9k8v5uj:/var/lib/ceph/bootstrap-osd/ceph.keyring:z quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf --dmcrypt --yes
> > --no-systemd
> > /usr/bin/podman: stderr Traceback (most recent call last):
> > /usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 33, in <module>
> > /usr/bin/podman: stderr
> > sys.exit(load_entry_point('ceph-volume==1.0.0', 'console_scripts',
> > 'ceph-volume')())
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 54, in
> > __init__
> > /usr/bin/podman: stderr self.main(self.argv)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/decorators.py", line
> > 59, in newfunc
> > /usr/bin/podman: stderr return f(*a, **kw)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/main.py", line 166, in
> > main
> > /usr/bin/podman: stderr terminal.dispatch(self.mapper, subcommand_args)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line
> > 194, in dispatch
> > /usr/bin/podman: stderr instance.main()
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/main.py",
> > line 46, in main
> > /usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/terminal.py", line
> > 192, in dispatch
> > /usr/bin/podman: stderr instance = mapper.get(arg)(argv[count:])
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/devices/lvm/batch.py",
> > line 325, in __init__
> > /usr/bin/podman: stderr self.args = parser.parse_args(argv)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 1825, in parse_args
> > /usr/bin/podman: stderr args, argv = self.parse_known_args(args, namespace)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 1858, in parse_known_args
> > /usr/bin/podman: stderr namespace, args =
> > self._parse_known_args(args, namespace)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 2067, in _parse_known_args
> > /usr/bin/podman: stderr start_index = consume_optional(start_index)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 2007, in consume_optional
> > /usr/bin/podman: stderr take_action(action, args, option_string)
> > /usr/bin/podman: stderr File "/usr/lib64/python3.9/argparse.py",
> > line 1935, in take_action
> > /usr/bin/podman: stderr action(self, namespace, argument_values,
> > option_string)
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/util/arg_validators.py", line
> > 17, in __call__
> > /usr/bin/podman: stderr set_dmcrypt_no_workqueue()
> > /usr/bin/podman: stderr File
> > "/usr/lib/python3.9/site-packages/ceph_volume/util/encryption.py",
> > line 54, in set_dmcrypt_no_workqueue
> > /usr/bin/podman: stderr raise RuntimeError('Error while checking
> > cryptsetup version.\n',
> > /usr/bin/podman: stderr RuntimeError: ('Error while checking
> > cryptsetup version.\n', '`cryptsetup --version` output:\n',
> > 'cryptsetup 2.7.2 flags: UDEV BLKID KEYRING FIPS KERNEL_CAPI
> > PWQUALITY ')
> > Traceback (most recent call last):
> > File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
> > return _run_code(code, main_globals, None,
> > File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
> > exec(code, run_globals)
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 11009, in <module>
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 10997, in main
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2593, in
> > _infer_config
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2509, in _infer_fsid > > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2621, in _infer_image
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2496, in
> > _validate_fsid
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 7226, in
> > command_ceph_volume
> > File "/tmp/tmpedb1_faj.cephadm.build/__main__.py", line 2284, in call_throws
> > RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
> > --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
> > --privileged --group-add=disk --init -e
> > CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c -e NODE_NAME=cn02.ceph.xyz.corp -e CEPH_VOLUME_OSDSPEC_AFFINITY=cost_capacity -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/run/ceph:z -v /var/log/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1:/var/log/ceph:z -v /var/lib/ceph/95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v /tmp/ceph-tmp49jj8zoh:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmp_9k8v5uj:/var/lib/ceph/bootstrap-osd/ceph.keyring:z quay.io/ceph/ceph@sha256:47de8754d1f72fadb61523247c897fdf673f9a9689503c64ca8384472d232c5c lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf --dmcrypt --yes
> > --no-systemd
> >
> > —
> >
> > ceph orch ls osd --export
> > service_type: osd
> > service_id: all-available-devices
> > service_name: osd.all-available-devices
> > placement:
> > host_pattern: '*'
> > spec:
> > data_devices:
> > all: true
> > filter_logic: AND
> > objectstore: bluestore
> > ---
> > service_type: osd
> > service_id: cost_capacity
> > service_name: osd.cost_capacity
> > placement:
> > host_pattern: '*'
> > spec:
> > data_devices:
> > rotational: 1
> > encrypted: true
> > filter_logic: AND
> > objectstore: bluestore
> >
> > Thank you
> > -jeremy
> >
> > > On Sunday, Apr 13, 2025 at 11:48 PM, Eugen Block <ebl...@nde.ag
> > > (mailto:ebl...@nde.ag)> wrote:
> > > Are you using Rook? Usually, I see this warning when a host is not
> > > reachable, for example during a reboot. But it also clears when the
> > > host comes back. Do you see this permanently or from time to time? It
> > > might have to do with the different Ceph versions, I'm not sure. But
> > > it shouldn't be a show stopper for the remaining upgrade. Or are you
> > > trying to deploy OSDs but it fails? You can paste
> > >
> > > ceph health detail
> > > ceph orch ls osd --export
> > >
> > > You can also scan the cephadm.log for any hints.
> > >
> > >
> > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > >
> > > > This looks relevant.
> > > >
> > > > https://github.com/rook/rook/issues/13600#issuecomment-1905860331
> > > >
> > > > > On Sunday, Apr 13, 2025 at 10:08 AM, Jeremy Hansen
> > > > > <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote:
> > > > > I’m now seeing this:
> > > > >
> > > > > cluster:
> > > > > id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
> > > > > health: HEALTH_WARN
> > > > > Failed to apply 1 service(s): osd.cost_capacity
> > > > >
> > > > >
> > > > > I’m assuming this is due to the fact that I’ve only upgraded mgr
> > > > > but I wanted to double check before proceeding with the rest of the
> > > > > components.
> > > > >
> > > > > Thanks
> > > > > -jeremy
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > On Sunday, Apr 13, 2025 at 12:59 AM, Jeremy Hansen
> > > > > <jer...@skidrow.la (mailto:jer...@skidrow.la)> wrote:
> > > > > > Updating mgr’s to 18.2.5 seemed to work just fine. I will go for
> > > > > the remaining services after the weekend. Thanks.
> > > > > >
> > > > > > -jeremy
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Thursday, Apr 10, 2025 at 6:37 AM, Eugen Block
> > > > > <ebl...@nde.ag (mailto:ebl...@nde.ag)> wrote:
> > > > > > > Glad I could help! I'm also waiting for 18.2.5 to upgrade our own
> > > > > > > cluster from Pacific after getting rid of our cache tier. :-D
> > > > > > >
> > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > >
> > > > > > > > This seems to have worked to get the orch back up and put
> > > me back to
> > > > > > > > 16.2.15. Thank you. Debating on waiting for 18.2.5 to
> > > move forward.
> > > > > > > >
> > > > > > > > -jeremy
> > > > > > > >
> > > > > > > > > On Monday, Apr 07, 2025 at 1:26 AM, Eugen Block <ebl...@nde.ag
> > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > Still no, just edit the unit.run file for the MGRs to use a
> > > > > different
> > > > > > > > > image. See Frédéric's instructions (now that I'm re-reading it,
> > > > > > > > > there's a little mistake with dots and hyphens):
> > > > > > > > >
> > > > > > > > > # Backup the unit.run file
> > > > > > > > > $ cp /var/lib/ceph/$(ceph
> > > fsid)/mgr.ceph01.eydqvm/unit.run{,.bak}
> > > > > > > > >
> > > > > > > > > # Change container image's signature. You can get the
> > > > > signature of the
> > > > > > > > > version you
> > > > > > > > > want to reach from
> > > > > https://quay.io/repository/ceph/ceph?tab=tags. It's
> > > > > > > > > in the URL of a
> > > > > > > > > version.
> > > > > > > > > $ sed -i
> > > > > > > > >
> > > > >
> > > 's/ceph@sha256:e40c19cd70e047d14d70f5ec3cf501da081395a670cd59ca881ff56119660c8f/ceph@sha256:d26c11e20773704382946e34f0d3d2c0b8bb0b7b37d9017faa9dc11a0196c7d9/g'
> > > > > > > > > /var/lib/ceph/$(ceph fsid)/mgr.ceph01.eydqvm/unit.run
> > > > > > > > >
> > > > > > > > > # Restart the container (systemctl daemon-reload not needed)
> > > > > > > > > $ systemctl restart ceph-$(ceph
> > > fsid)(a)mgr.ceph01.eydqvm.service
> > > > > > > > >
> > > > > > > > > # Run this command a few times and it should show the
> > > new version
> > > > > > > > > ceph orch ps --refresh --hostname ceph01 | grep mgr
> > > > > > > > >
> > > > > > > > > To get the image signature, you can also look into the
> > > > > other unit.run
> > > > > > > > > files, a version tag would also work.
> > > > > > > > >
> > > > > > > > > It depends on how often you need the orchestrator to
> > > maintain the
> > > > > > > > > cluster. If you have the time, you could wait a bit
> > > longer for other
> > > > > > > > > responses. If you need the orchestrator in the meantime,
> > > > > you can roll
> > > > > > > > > back the MGRs.
> > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/
> > > > > > > > >
> > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > >
> > > > > > > > > > Thank you. The only thing I’m unclear on is the rollback
> > > > > to pacific.
> > > > > > > > > >
> > > > > > > > > > Are you referring to
> > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
> > > > > > > > > >
> > > > > > > > > > Thank you. I appreciate all the help. Should I wait
> > > for Adam to
> > > > > > > > > > comment? At the moment, the cluster is functioning enough to > > > > > > > > > > maintain running vms, so if it’s wise to wait, I can do that.
> > > > > > > > > >
> > > > > > > > > > -jeremy
> > > > > > > > > >
> > > > > > > > > > > On Monday, Apr 07, 2025 at 12:23 AM, Eugen Block
> > > <ebl...@nde.ag
> > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > I haven't tried it this way yet, and I had hoped that
> > > > > Adam would chime
> > > > > > > > > > > in, but my approach would be to remove this key (it's
> > > > > not present when
> > > > > > > > > > > no upgrade is in progress):
> > > > > > > > > > >
> > > > > > > > > > > ceph config-key rm mgr/cephadm/upgrade_state
> > > > > > > > > > >
> > > > > > > > > > > Then rollback the two newer MGRs to Pacific as
> > > > > described before. If
> > > > > > > > > > > they come up healthy, test if the orchestrator works
> > > > > properly first.
> > > > > > > > > > > For example, remove a node-exporter or crash or
> > > anything else
> > > > > > > > > > > uncritical and let it redeploy.
> > > > > > > > > > > If that works, try a staggered upgrade, starting with
> > > > > the MGRs only:
> > > > > > > > > > >
> > > > > > > > > > > ceph orch upgrade start --image <image-name>
> > > --daemon-types mgr
> > > > > > > > > > >
> > > > > > > > > > > Since there's no need to go to Quincy, I suggest to
> > > > > upgrade to Reef
> > > > > > > > > > > 18.2.4 (or you wait until 18.2.5 is released, which
> > > > > should be very
> > > > > > > > > > > soon), so set the respective <image-name> in the
> > > above command.
> > > > > > > > > > >
> > > > > > > > > > > If all three MGRs successfully upgrade, you can
> > > > > continue with the
> > > > > > > > > > > MONs, or with the entire rest.
> > > > > > > > > > >
> > > > > > > > > > > In production clusters, I usually do staggered
> > > > > upgrades, e. g. I limit
> > > > > > > > > > > the number of OSD daemons first just to see if they
> > > > > come up healthy,
> > > > > > > > > > > then I let it upgrade all other OSDs automatically.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > https://docs.ceph.com/en/latest/cephadm/upgrade/#staggered-upgrade
> > > > > > > > > > >
> > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > > > >
> > > > > > > > > > > > Snipped some of the irrelevant logs to keep
> > > message size down.
> > > > > > > > > > > >
> > > > > > > > > > > > ceph config-key get mgr/cephadm/upgrade_state
> > > > > > > > > > > >
> > > > > > > > > > > > {"target_name": "quay.io/ceph/ceph:v17.2.0",
> > > "progress_id":
> > > > > > > > > > > > "e7e1a809-558d-43a7-842a-c6229fdc57af", "target_id":
> > > > > > > > > > > >
> > > > > "e1d6a67b021eb077ee22bf650f1a9fb1980a2cf5c36bdb9cba9eac6de8f702d9",
> > > > > > > > > > > > "target_digests":
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > ["quay.io/ceph/ceph@sha256:12a0a4f43413fd97a14a3d47a3451b2d2df50020835bb93db666209f3f77617a", "quay.io/ceph/ceph@sha256:cb4d698cb769b6aba05bf6ef04f41a7fe694160140347576e13bd9348514b667"], "target_version": "17.2.0", "fs_original_max_mds": null, "fs_original_allow_standby_replay": null, "error": null, "paused": false, "daemon_types": null, "hosts": null, "services":
> > > null,
> > > > > "total_count":
> > > > > > > > > null,
> > > > > > > > > > > "remaining_count":
> > > > > > > > > > > > null}
> > > > > > > > > > > >
> > > > > > > > > > > > What should I do next?
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you!
> > > > > > > > > > > > -jeremy
> > > > > > > > > > > >
> > > > > > > > > > > > > On Sunday, Apr 06, 2025 at 1:38 AM, Eugen Block
> > > > > <ebl...@nde.ag
> > > > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > > > Can you check if you have this config-key?
> > > > > > > > > > > > >
> > > > > > > > > > > > > ceph config-key get mgr/cephadm/upgrade_state
> > > > > > > > > > > > >
> > > > > > > > > > > > > If you reset the MGRs, it might be necessary to
> > > > > clear this key,
> > > > > > > > > > > > > otherwise you might end up in some inconsistency.
> > > > > Just to be sure.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks. I’m trying to be extra careful since this
> > > > > cluster is
> > > > > > > > > > > > > > actually in use. I’ll wait for your feedback.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -jeremy
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 3:39 PM, Eugen
> > > > > Block <ebl...@nde.ag
> > > > > > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > > > > > No, that's not necessary, just edit the
> > > > > unit.run file for
> > > > > > > > > > > the MGRs to
> > > > > > > > > > > > > > > use a different image. See Frédéric's instructions:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/32APKOXKRAIZ7IDCNI25KVYFCCCF6RJG/
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > But I'm not entirely sure if you need to clear some
> > > > > > > > > > > config-keys first
> > > > > > > > > > > > > > > in order to reset the upgrade state. If I have
> > > > > time, I'll
> > > > > > > > > > > try to check
> > > > > > > > > > > > > > > tomorrow, or on Monday.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Zitat von Jeremy Hansen <jer...@skidrow.la>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Would I follow this process to downgrade?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-manager-daemon
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 2:04 PM,
> > > Jeremy Hansen
> > > > > > > > > > > > > > > > > <jer...@skidrow.la
> > > > > (mailto:jer...@skidrow.la)> wrote:
> > > > > > > > > > > > > > > > > ceph -s claims things are healthy:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ceph -s
> > > > > > > > > > > > > > > > > cluster:
> > > > > > > > > > > > > > > > > id: 95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1
> > > > > > > > > > > > > > > > > health: HEALTH_OK
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > services:
> > > > > > > > > > > > > > > > > mon: 3 daemons, quorum cn01,cn03,cn02 (age 20h)
> > > > > > > > > > > > > > > > > mgr: cn03.negzvb(active, since 26m),
> > > > > standbys: cn01.tjmtph,
> > > > > > > > > > > > > > > > > cn02.ceph.xyz.corp.ggixgj
> > > > > > > > > > > > > > > > > mds: 1/1 daemons up, 2 standby
> > > > > > > > > > > > > > > > > osd: 15 osds: 15 up (since 19h), 15 in
> > > (since 14M)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > data:
> > > > > > > > > > > > > > > > > volumes: 1/1 healthy
> > > > > > > > > > > > > > > > > pools: 6 pools, 610 pgs
> > > > > > > > > > > > > > > > > objects: 284.59k objects, 1.1 TiB
> > > > > > > > > > > > > > > > > usage: 3.3 TiB used, 106 TiB / 109 TiB avail
> > > > > > > > > > > > > > > > > pgs: 610 active+clean
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > io:
> > > > > > > > > > > > > > > > > client: 255 B/s rd, 1.2 MiB/s wr, 10 op/s
> > > > > rd, 16 op/s wr
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > —
> > > > > > > > > > > > > > > > > How do I downgrade if the orch is down?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you
> > > > > > > > > > > > > > > > > -jeremy
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Saturday, Apr 05, 2025 at 1:56 PM,
> > > Eugen Block
> > > > > > > > > > > <ebl...@nde.ag
> > > > > > > > > > > > > > > > > (mailto:ebl...@nde.ag)> wrote:
> > > > > > > > > > > > > > > > > > It would help if you only pasted the
> > > > > relevant parts.
> > > > > > > > > > > > > Anyway, these two
> > > > > > > > > > > > > > > > > > sections stand out:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > ---snip---
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.909+0000
> > > 7f26f0200700 0
> > > > > > > > > > > > > [balancer INFO root]
> > > > > > > > > > > > > > > > > > Some PGs (1.000000) are unknown; try
> > > again later
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.917+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Failed to
> > > > > > > > > > > > > > > > > > construct class in 'cephadm'
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.917+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Traceback
> > > > > > > > > > > > > > > > > > (most recent call last):
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > "/usr/share/ceph/mgr/cephadm/module.py", line 470,
> > > > > > > > > > > > > in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > self.upgrade = CephadmUpgrade(self)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > "/usr/share/ceph/mgr/cephadm/upgrade.py", line 112,
> > > > > > > > > > > > > in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > self.upgrade_state: Optional[UpgradeState] =
> > > > > > > > > > > > > > > > > > UpgradeState.from_json(json.loads(t))
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > "/usr/share/ceph/mgr/cephadm/upgrade.py", line 93,
> > > > > > > > > > > > > in from_json
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > return cls(**c)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > TypeError: __init__() got an unexpected
> > > > > keyword argument
> > > > > > > > > > > > > > > 'daemon_types'
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > Apr 05 20:33:48 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:48.918+0000
> > > 7f2663400700 -1
> > > > > > > > > > > mgr operator()
> > > > > > > > > > > > > > > > > > Failed to run module in active mode
> > > ('cephadm')
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:49.273+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Failed to
> > > > > > > > > > > > > > > > > > construct class in 'snap_schedule'
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:49.273+0000
> > > > > 7f2663400700 -1 mgr
> > > > > > > > > > > > > load Traceback
> > > > > > > > > > > > > > > > > > (most recent call last):
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > > > > > "/usr/share/ceph/mgr/snap_schedule/module.py", line 38,
> > > > > > > > > > > > > > > in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > self.client = SnapSchedClient(self)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > > > > > > > > >
> > > > > "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line
> > > > > > > > > > > > > > > > > > 158, in __init__
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > with self.get_schedule_db(fs_name) as
> > > conn_mgr:
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > File
> > > > > > > > > > > > >
> > > > > "/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py", line
> > > > > > > > > > > > > > > > > > 192, in get_schedule_db
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > db.executescript(dump)
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > sqlite3.OperationalError: table schedules
> > > > > already exists
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > Apr 05 20:33:49 cn03.ceph.xyz.corp
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > ceph-95f49c1c-b1e8-11ee-b5d0-0cc47a8f35c1-mgr-cn03-negzvb[307291]:
> > > > > > > > > > > > > > > > > > debug 2025-04-05T20:33:49.274+0000
> > > 7f2663400700 -1
> > > > > > > > > > > mgr operator()
> > > > > > > > > > > > > > > > > > Failed to run module in active mode
> > > > > ('snap_schedule')
> > > > > > > > > > > > > > > > > > ---snip---
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Your cluster seems to be in an error
> > > > > state (ceph -s)
> > > > > > > > > > > because of an
> > > > > > > > > > > > > > > > > > unknown PG. It's recommended to have a healthy
> > > > > > > > > cluster before
> > > > > > > > > > > > > > > > > > attemping an upgrade. It's possible that
> > > > > these errors
> > > > > > > > > > > > > come from the
> > > > > > > > > > > > > > > > > > not upgraded MGR, I'm not sure.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Since the upgrade was only successful for
> > > > > two MGRs, I
> > > > > > > > > > > am thinking
> > > > > > > > > > > > > > > > > > about downgrading both MGRs back to
> > > > > 16.2.15, then retry
> > > > > > > > > > > > > an upgrade to
> > > > > > > > > > > > > > > > > > a newer version, either 17.2.8 or 18.2.4.
> > > > > I haven't
> > > > > > > > > > > checked the
> > > > > > > > > > > > > > > > > > snap_schedule error yet, though. Maybe
> > > > > someone else knows
> > > > > > > > > > > > > > > that already.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > >
> > >
> > >
>
>
>


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to