[ceph-users] Re: Cephadm: unable to copy ceph.conf.new
Hi, I commented a similar issue a couple of months ago: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/ Can you check if that applies to your cluster? Zitat von Magnus Larsen : Hi Ceph-users! Ceph version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) Using cephadm to orchestrate the Ceph cluster I’m running into https://tracker.ceph.com/issues/59189, which is fixed in next version—quincy 17.2.7—via https://github.com/ceph/ceph/pull/50906 But I am unable to upgrade to the fixed version because of that bug When I try to upgrade (using “ceph orch upgrade start –image internal_mirror/ceph:v17.2.7”), we see the same error message: executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file conn = await self._remote_connection(host, addr) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/etc/ceph/ceph.conf.new: Permission denied During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file self.mgr.wait_async(self._write_remote_file( File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in _write_remote_file logger.exception(msg) orchestrator._interface.OrchestratorError: Unable to write dkcphhpcmgt028:/etc/ceph/ceph.conf: scp: /tmp/etc/ceph/ceph.conf.new: Permission denied We were thinking about removing the keyring from the Ceph orchestrator (https://docs.ceph.com/en/latest/cephadm/operations/#putting-a-keyring-under-management), which would then make Ceph not try to copy over a new ceph.conf, alleviating the problem (https://docs.ceph.com/en/latest/cephadm/operations/#client-keyrings-and-configs), but in doing so, Ceph will kindly remove the key from all nodes (https://docs.ceph.com/en/latest/cephadm/operations/#disabling-management-of-a-keyring-file) leaving us without the admin keyring. So that doesn’t sound like a path we want to take :S Does anybody know how to get around this issue, so I can get to version where the issue fixed for good? Thanks, Magnus ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Pull failed on cluster upgrade
Unfortunately I'm on bare metal, with very old hardware so I cannot do much. I'd try to build a Ceph image based on Rocky Linux 8 if I could get the Dockerfile of the current image to start with, but I've not been able to find it. Can you please help me with this? Cheers, Nicola smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Pull failed on cluster upgrade
Hi, > On 7 Aug 2024, at 10:31, Nicola Mori wrote: > > Unfortunately I'm on bare metal, with very old hardware so I cannot do much. > I'd try to build a Ceph image based on Rocky Linux 8 if I could get the > Dockerfile of the current image to start with, but I've not been able to find > it. Can you please help me with this? You can try luck with packages, if I understood the problem correctly [1] Apparently this is a problem of our time. Now the developer writes software for the container, and the fact that under the container there is hardware that the container is not happy with is "well, buy new hardware" k [1] https://github.com/ceph/ceph-build/pull/2272 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERN] Re: Pull failed on cluster upgrade
On 8/7/24 09:40, Konstantin Shalygin wrote: Hi, On 7 Aug 2024, at 10:31, Nicola Mori wrote: Unfortunately I'm on bare metal, with very old hardware so I cannot do much. I'd try to build a Ceph image based on Rocky Linux 8 if I could get the Dockerfile of the current image to start with, but I've not been able to find it. Can you please help me with this? You can try luck with packages, if I understood the problem correctly [1] Apparently this is a problem of our time. Now the developer writes software for the container, and the fact that under the container there is hardware that the container is not happy with is "well, buy new hardware" It would be very helpful for Ceph admins if the upgrade routines first check if an upgrade is supported by the underlying hardware: $ ceph orch upgrade start --ceph-version should fail in case of unsupported hw. Just an idea Dietmar OpenPGP_signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm: unable to copy ceph.conf.new
Hi, please don't drop the ML from your response. Is this the first upgrade you're attempting or did previous upgrades work with the current config? I wonder if can generate a new ssh configuration for the root user, and then use that to upgrade to the fixed version. The permissions will then be owned by root, which means we can't use the ceph user, no? I do remember having an issue with non-root user on a customer cluster, but IIRC it was because of insufficient sudo permissions. In the end, they switched to root user, and there haven't been any issues since, at least nobody reported anything to me. Do you mind sharing your sudo config for the ceph user? Thanks, Eugen Zitat von Magnus Larsen : Hi, We do have client-keyring with the label: # ceph orch client-keyring ls ENTITYPLACEMENT MODE OWNER PATH client.admin label:_admin rw--- 0:0 /etc/ceph/ceph.client.admin.keyring And the SSH-config is also correct (verified just now) - though we use ceph as the user, not the default root, which works normally, except that we can't upgrade until we get the fix in... which is in the next upgrade :< I wonder if can generate a new ssh configuration for the root user, and then use that to upgrade to the fixed version. The permissions will then be owned by root, which means we can't use the ceph user, no? ref: https://docs.ceph.com/en/octopus/cephadm/operations/#ssh-configuration Thanks! Magnus Larsen Fra: Eugen Block Sendt: 7. august 2024 09:15 Til: ceph-users@ceph.io Emne: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new Hi, I commented a similar issue a couple of months ago: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/ Can you check if that applies to your cluster? Zitat von Magnus Larsen : Hi Ceph-users! Ceph version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) Using cephadm to orchestrate the Ceph cluster I’m running into https://tracker.ceph.com/issues/59189, which is fixed in next version—quincy 17.2.7—via https://github.com/ceph/ceph/pull/50906 But I am unable to upgrade to the fixed version because of that bug When I try to upgrade (using “ceph orch upgrade start –image internal_mirror/ceph:v17.2.7”), we see the same error message: executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file conn = await self._remote_connection(host, addr) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/etc/ceph/ceph.conf.new: Permission denied During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file self.mgr.wait_async(self._write_remote_file( File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception Fil
[ceph-users] Re: Cephadm: unable to copy ceph.conf.new
Hi, Sorry! fixed. The configuration is a follows: root@management-node1 # cat /etc/sudoers.d/ceph ceph ALL=(ALL) NOPASSWD: ALL So.. no restrictions :^) Fra: Eugen Block Sendt: 7. august 2024 10:38 Til: Magnus Larsen Cc: ceph-users@ceph.io Emne: Re: Sv: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new Hi, please don't drop the ML from your response. Is this the first upgrade you're attempting or did previous upgrades work with the current config? > I wonder if can generate a new ssh configuration for the root user, > and then use that to upgrade to the fixed version. > The permissions will then be owned by root, which means we can't use > the ceph user, no? I do remember having an issue with non-root user on a customer cluster, but IIRC it was because of insufficient sudo permissions. In the end, they switched to root user, and there haven't been any issues since, at least nobody reported anything to me. Do you mind sharing your sudo config for the ceph user? Thanks, Eugen Zitat von Magnus Larsen : > Hi, > > We do have client-keyring with the label: > # ceph orch client-keyring ls > ENTITYPLACEMENT MODE OWNER PATH > client.admin label:_admin rw--- 0:0 > /etc/ceph/ceph.client.admin.keyring > > And the SSH-config is also correct (verified just now) - though we > use ceph as the user, not the default root, > which works normally, except that we can't upgrade until we get the > fix in... which is in the next upgrade :< > > I wonder if can generate a new ssh configuration for the root user, > and then use that to upgrade to the fixed version. > The permissions will then be owned by root, which means we can't use > the ceph user, no? > > ref: https://docs.ceph.com/en/octopus/cephadm/operations/#ssh-configuration > > Thanks! > Magnus Larsen > > > Fra: Eugen Block > Sendt: 7. august 2024 09:15 > Til: ceph-users@ceph.io > Emne: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new > > Hi, > > I commented a similar issue a couple of months ago: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/ > > Can you check if that applies to your cluster? > > Zitat von Magnus Larsen : > >> Hi Ceph-users! >> >> Ceph version: ceph version 17.2.6 >> (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) >> Using cephadm to orchestrate the Ceph cluster >> >> I’m running into https://tracker.ceph.com/issues/59189, which is >> fixed in next version—quincy 17.2.7—via >> https://github.com/ceph/ceph/pull/50906 >> >> But I am unable to upgrade to the fixed version because of that bug >> >> When I try to upgrade (using “ceph orch upgrade start –image >> internal_mirror/ceph:v17.2.7”), we see the same error message: >> executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', >> 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', >> 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', >> 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', >> 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', >> 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most >> recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line >> 240, in _write_remote_file conn = await >> self._remote_connection(host, addr) File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp >> await source.run(srcpath) File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run >> self.handle_error(exc) File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in >> handle_error raise exc from None File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run >> await self._send_files(path, b'') File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in >> _send_files self.handle_error(exc) File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in >> handle_error raise exc from None File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in >> _send_files await self._send_file(srcpath, dstpath, attrs) File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in >> _send_file await self._make_cd_request(b'C', attrs, size, srcpath) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in >> _make_cd_request self._fs.basename(path)) File >> "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in >> make_request raise exc asyncssh.sftp.SFTPFailure: scp: >> /tmp/etc/ceph/ceph.conf.new: Permission denied During handling of >> the above exception, another exception occurred: Traceback (most >> recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line >> 79, in do_work return f(*arg) File >> "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files >> self._write_client_files(client_files, host) File >> "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in >> _write_client_files self.mgr.ssh.write_remote_file(host, path, >> content, mode, uid, gid) File "/usr/share/cep
[ceph-users] Re: Cephadm: unable to copy ceph.conf.new
And are any of the hosts shown as offline in the 'ceph orch host ls' output? Is this the first upgrade you're attempting or did previous upgrades work with the current config? Zitat von Magnus Larsen : Hi, Sorry! fixed. The configuration is a follows: root@management-node1 # cat /etc/sudoers.d/ceph ceph ALL=(ALL) NOPASSWD: ALL So.. no restrictions :^) Fra: Eugen Block Sendt: 7. august 2024 10:38 Til: Magnus Larsen Cc: ceph-users@ceph.io Emne: Re: Sv: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new Hi, please don't drop the ML from your response. Is this the first upgrade you're attempting or did previous upgrades work with the current config? I wonder if can generate a new ssh configuration for the root user, and then use that to upgrade to the fixed version. The permissions will then be owned by root, which means we can't use the ceph user, no? I do remember having an issue with non-root user on a customer cluster, but IIRC it was because of insufficient sudo permissions. In the end, they switched to root user, and there haven't been any issues since, at least nobody reported anything to me. Do you mind sharing your sudo config for the ceph user? Thanks, Eugen Zitat von Magnus Larsen : Hi, We do have client-keyring with the label: # ceph orch client-keyring ls ENTITYPLACEMENT MODE OWNER PATH client.admin label:_admin rw--- 0:0 /etc/ceph/ceph.client.admin.keyring And the SSH-config is also correct (verified just now) - though we use ceph as the user, not the default root, which works normally, except that we can't upgrade until we get the fix in... which is in the next upgrade :< I wonder if can generate a new ssh configuration for the root user, and then use that to upgrade to the fixed version. The permissions will then be owned by root, which means we can't use the ceph user, no? ref: https://docs.ceph.com/en/octopus/cephadm/operations/#ssh-configuration Thanks! Magnus Larsen Fra: Eugen Block Sendt: 7. august 2024 09:15 Til: ceph-users@ceph.io Emne: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new Hi, I commented a similar issue a couple of months ago: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/ Can you check if that applies to your cluster? Zitat von Magnus Larsen : Hi Ceph-users! Ceph version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) Using cephadm to orchestrate the Ceph cluster I’m running into https://tracker.ceph.com/issues/59189, which is fixed in next version—quincy 17.2.7—via https://github.com/ceph/ceph/pull/50906 But I am unable to upgrade to the fixed version because of that bug When I try to upgrade (using “ceph orch upgrade start –image internal_mirror/ceph:v17.2.7”), we see the same error message: executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file conn = await self._remote_connection(host, addr) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/etc/ceph/ceph.conf.new: Permission denied During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File "/usr/share/ceph/mgr/cephad
[ceph-users] Re: Pull failed on cluster upgrade
Thank you Konstantin, as it was foreseeable this problem didn't hit just me. So I hope the build of images based on CentOS Stream 8 will be resumed. Otherwise I'll try to build myself. Nicola smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Multi-Site sync error with multipart objects: Resource deadlock avoided
Hi, We've been trying to set up multi-site sync on two test VMs before rolling things out on actual production hardware. Both are running Ceph 18.2.4 deployed via cephadm. Host OS is Debian 12, container runtime is podman (switched from Debian 11 and docker.io, same error there). There is only one RGW daemon on each site. Ceph config is pretty much defaults. One thing I did change was setting rgw_relaxed_region_enforcement to true because the zonegroup got renamed from "default" during the switch to multi-site using the dashboard's assistant. There's nothing special like server-side encryption either. Our end goal is to replicate all RGW data from our current cluster to a new one. The Multi-Site configuration itself went pretty smoothly through the dashboard and pre-existing data started syncing right away. Unfortunately, not all objects made it. To be precise, none of the larger objects over the multipart threshold got synced. This is consistent for newly uploaded multipart objects as well. Curiously, it's working fine in the other direction, i.e. multipart uploads from the secondary zone do get synced to the master. Here are some relevant logs: >From `radosgw-admin sync error list`: { "shard_id": 26, "entries": [ { "id": "1_1722598249.479766_23730.1", "section": "data", "name": "foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7/logstash_1%3a8.12.2-1_amd64.deb", "timestamp": "2024-08-02T11:30:49.479766Z", "info": { "source_zone": "5160b406-4428-4fdc-9c5d-5ec9fe9404c0", "error_code": 35, "message": "failed to sync object(35) Resource deadlock avoided" } } ] }, >From RGW on the receiving end: Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.474+ 7f3a6243e640 0 rgw async rados processor: store->fetch_remote_obj() returned r=-35 Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.474+ 7f3a36b7b640 2 req 7168648379339657593 0.0s :list_data_changes_log normalizing buckets and tenants Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.474+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log init permissions Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log recalculating target Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log reading permissions Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log init op Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log verifying op mask Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log verifying op permissions Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 overriding permissions due to system operation Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a36b7b640 2 req 7168648379339657593 0.003999872s :list_data_changes_log verifying op params Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 2024-08-02T11:30:49.478+ 7f3a5241e640 0 RGW-SYNC:data:sync:shard[28]:entry[foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7[0]]:bucket_sync_sources[source=foobar:new[5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3]):7:source_zone=5160b406-4428-4fdc-9c5d-5ec9fe9404c0]:bucket[foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3<-foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7]:inc_sync[foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7]:entry[logstash_1%3a8.12.2-1_amd64.deb]: ERROR: failed to sync object: foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7/logstash_1%3a8.12.2-1_amd64.deb And from the sender: Aug 02 13:30:49 test-ceph-single bash[885118]: debug 2024-08-02T11:30:49.476+ 7f0acfdb2640 1 == req done req=0x7f0ab50e4710 op status=-104 http_status=200 latency=0.419986606s == Aug 02 13:30:49 test-ceph-single bash[885118]: debug 2024-08-02T11:30:49.476+ 7f0ba9f66640 2 req 5943847843579143466 0.0s initializing for trans_id = tx0527cca1f3381a52a-0066acc369-c052e6-eu2 Aug 02 13:30:49 test-ceph-single bash[885118]: debug 2024-08-02T11:30:49.476+ 7f0acfdb2640 1 beast: 0x7f0ab50e4710: 10.139.0.151 - synchronization-user [02/Aug/2024:11:30:49.056 +] "GET
[ceph-users] Re: Cephadm: unable to copy ceph.conf.new
It might be worth trying to manually upgrade one of the mgr daemons. If you go to the host with a mgr and edit the /var/lib/ceph///unit.run so that the image specified in the long podman/docker run command in there is the 17.2.7 image. Then just restart its systemd unit (don't tell the orchestrator to do the restart of the mgr. That can cause your change to the unit.run fiel to be overwritten). If you only have two mgr daemons you should be able to use failovers to make that one the active mgr at which point the active mgr will have the patch that fixes this issue and you should be able to get the upgrade going. `ceph orch daemon redeploy --image <17.2.7 image> might also work, but I tend to find the manual steps are more reliable for this sort of issue as you don't have to worry about issues within the orchestrator causing that operation to fail. On Tue, Aug 6, 2024 at 7:26 PM Magnus Larsen wrote: > Hi Ceph-users! > > Ceph version: ceph version 17.2.6 > (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) > Using cephadm to orchestrate the Ceph cluster > > I’m running into https://tracker.ceph.com/issues/59189, which is fixed in > next version—quincy 17.2.7—via > https://github.com/ceph/ceph/pull/50906 > > But I am unable to upgrade to the fixed version because of that bug > > When I try to upgrade (using “ceph orch upgrade start –image > internal_mirror/ceph:v17.2.7”), we see the same error message: > executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', > 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', > 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', > 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', > 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call > last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in > _write_remote_file conn = await self._remote_connection(host, addr) File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await > source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", > line 458, in run self.handle_error(exc) File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error > raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", > line 456, in run await self._send_files(path, b'') File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files > self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", > line 307, in handle_error raise exc from None File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files > await self._send_file(srcpath, dstpath, attrs) File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file > await self._make_cd_request(b'C', attrs, size, srcpath) File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in > _make_cd_request self._fs.basename(path)) File > "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request > raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/etc/ceph/ceph.conf.new: > Permission denied During handling of the above exception, another exception > occurred: Traceback (most recent call last): File > "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) > File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files > self._write_client_files(client_files, host) File > "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files > self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File > "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file > self.mgr.wait_async(self._write_remote_file( File > "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return > self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", > line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, > self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", > line 432, in result return self.__get_result() File > "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result > raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, > in _write_remote_file logger.exception(msg) > orchestrator._interface.OrchestratorError: Unable to write > dkcphhpcmgt028:/etc/ceph/ceph.conf: scp: /tmp/etc/ceph/ceph.conf.new: > Permission denied > > We were thinking about removing the keyring from the Ceph orchestrator ( > https://docs.ceph.com/en/latest/cephadm/operations/#putting-a-keyring-under-management > ), > which would then make Ceph not try to copy over a new ceph.conf, > alleviating the problem ( > https://docs.ceph.com/en/latest/cephadm/operations/#client-keyrings-and-configs > ), > but in doing so, Ceph will kindly remove the key from all nodes ( > https://docs.ceph.com/en/latest/cephadm/operations/#disabling-management-of-a-keyring-file > ) > leaving us without the admin keyring. So that doesn’t sound like a path we > want to take :S
[ceph-users] mds damaged with preallocated inodes that are inconsistent with inotable
HI, Experts, we are running a cephfs with V16.2.*, and has multi active mds. Currently, we are hitting a mds fs cephfs mds.* id damaged. and this mds always complain “client *** loaded with preallocated inodes that are inconsistent with inotable” and the mds always suicide during replay. Could anyone please help here ? We really need you shed some light! Thanks lot ! xz ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Any way to put the rate limit on rbd flatten operation?
Hello, AFAIK, massive rx/tx occurs on the client side for the flatten operation. so, I want to control the network rate limit or predict the network bandwidth it will consume. Is there any way to do that? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can you return orphaned objects to a bucket?
Hi, You're right. The object reindex subcommand backport was rejected for P and is still pending for Q and R. [1] Use rgw-restore-bucket-index script instead. Regards, Frédéric. [1] https://tracker.ceph.com/issues/61405 De : vuphun...@gmail.com Envoyé : mercredi 7 août 2024 01:38 À : ceph-users@ceph.io Objet : [ceph-users] Re: Can you return orphaned objects to a bucket? Hi, Currently I see it only supports the latest version, is there any way to support old versions like Pacific or Quincy? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW sync gets stuck every day
Hi, Redeploying stuff seems like a much too big hammer to get things going again. Surely there must be something more reasonable? wouldn't a restart suffice? Do you see anything in the 'radosgw-admin sync error list'? Maybe an error prevents the sync from continuing? Zitat von Olaf Seibert : Hi all, we have some Ceph clusters with RGW replication between them. It seems that in the last month at least, it gets stuck at around the same time ~every day. Not 100% the same time, and also not 100% of the days, but in the more recent days seem to happen more, and for longer. With "stuck" I mean that the "oldest incremental change not applied" is getting 5 or more minutes old, and not changing. In the past this seemed to resolve itself in a short time, but recently it didn't. It remained stuck at the same place for several hours. Also, on several different occasions I noticed that the shard number in question was the same. We are using Ceph 18.2.2, image id 719d4c40e096. The output on one end looks like this (I redacted out some of the data because I don't know how much of the naming would be sensitive information): root@zone2:/# radosgw-admin sync status --rgw-realm backup realm ----8ddf4576ebab (backup) zonegroup ----58af9051e063 (backup) zone ----e1223ae425a4 (zone2-backup) current time 2024-08-04T10:22:00Z zonegroup features enabled: resharding disabled: compress-encrypted metadata sync no sync (zone is master) data sync source: ----e8db1c51b705 (zone1-backup) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 3 shards behind shards: [30,90,95] oldest incremental change not applied: 2024-08-04T10:05:54.015403+ [30] while on the other side it looks ok (not more than half a minute behind): root@zone1:/# radosgw-admin sync status --rgw-realm backup realm ----8ddf4576ebab (backup) zonegroup ----58af9051e063 (backup) zone ----e8db1c51b705 (zone1-backup) current time 2024-08-04T10:23:05Z zonegroup features enabled: resharding disabled: compress-encrypted metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: ----e1223ae425a4 (zone2-backup) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 4 shards behind shards: [89,92,95,98] oldest incremental change not applied: 2024-08-04T10:22:53.175975+ [95] With some experimenting, we found that redeploying the RGWs on this side resolves the situation: "ceph orch redeploy rgw.zone1-backup". The shards go into "Recovering" state and after a short time it is "caught up with source" as well. Redeploying stuff seems like a much too big hammer to get things going again. Surely there must be something more reasonable? Also, any ideas about how we can find out what is causing this? It may be that some customer has some job running every 24 hours, but that shouldn't cause the replication to get stuck. Thanks in advance, -- Olaf Seibert Site Reliability Engineer SysEleven GmbH Boxhagener Straße 80 10245 Berlin T +49 30 233 2012 0 F +49 30 616 7555 0 https://www.syseleven.de https://www.linkedin.com/company/syseleven-gmbh/ Current system status always at: https://www.syseleven-status.net/ Company headquarters: Berlin Registered court: AG Berlin Charlottenburg, HRB 108571 Berlin Managing directors: Andreas Hermann, Jens Ihlenfeld, Norbert Müller, Jens Plogsties ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please guide us inidentifying thecause ofthedata miss in EC pool
Hi Chulin, Are you 100% sure that 494, 1169 and 1057 (that did not restart) were in the acting set at the exact moment the power outage occured? I'm asking because min_size 6 would have allowed the data to be written to eventually 6 crashing OSDs. Bests, Frédéric. De : Best Regards Envoyé : jeudi 8 août 2024 08:10 À : Frédéric Nass Cc: ceph-users Objet : Re:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying thecause ofthedata miss in EC pool Hi, Frédéric Nass Thank you for your continued attention and guidance. Let's analyze and verify this issue from different perspectives. The reason why we did not stop the investigation is that we tried to find other ways to avoid the losses caused by this sudden failure. Turning off the disk cache is the last option, of course, this operation will only be carried out after finding definite evidence. I also have a question that among the 9 OSDs, some have not been restarted. In theory, these OSDs should retain the object info(metadata,pg_log,etc.), even if the object cannot be recovered. I sorted out the OSD booting log where the object should be located and the PG peering process: OSD 494/1169/1057 has been in the running state, and osd.494 was the primary of the acting_set during the failure. However, no record of the object was found using `ceph-object-tool --op list or --op log` in, so the loss of data due to disk cache loss does not seem to be a complete explanation (perhaps there is some processing logic that we have not paid attention to). Best Regards, Woo wu_chu...@qq.com Best Regards Original Email From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr >; Sent Time:2024/8/8 4:01 To:"wu_chulin"< wu_chu...@qq.com >; Subject:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying thecause ofthedata miss in EC pool Hey Chulin, Looks clearer now. Non-persistent cache for KV metadata and Bluestore metadata certainly explains how data was lost without the cluster even noticing. What's unexpected is data staying for so long in the disks buffers and not being written to persistent sectors at all. Anyways, thank you for sharing your use case and investigation. It was nice chatting with you. If you can, share this in the ceph-user list. It will for sure benefit everyone in the community. Best regards, Frédéric. PS : Note that using min_size >= k + 1 on EC pools is recommended (so as min_size >= 2 on rep X3 pools) because you don't want to write data without any parity chunks. De : wu_chu...@qq.com Envoyé : mercredi 7 août 2024 11:30 À : Frédéric Nass Objet : Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause ofthedata miss in EC pool Hi, Yes, after the file -> object -> PG -> OSD correspondence is found, the object record can be found on the specified OSD using the command `ceph-objectore-tool --op list ` The pool min_size is 6 The business department reported more than 30, but we proactively screened out more than 100. The upload time of the lost files was mainly distributed about 3 hours before the failure, and these files were successfully downloaded after being uploaded (RGW log). One OSD corresponds to one disk, and no separate space is allocated for WAL/DB. The HDD cache is the default (SATA is enabled by default), and the hard disk cache has not been forcibly turned off due to performance issues. The loss of OSD data due to the loss of hard disk cache was our initial inference, and the initial explanation provided to the business department was the same. When the cluster was restored, ceph reported 12 unfound objects, which is acceptable. After all, most devices were powered off abnormally, and it is difficult to ensure the integrity of all data. Up to now, our team have not located how the data was lost. In the past, when the hard disk hardware was damaged, either the OSD could not start because of damaged key data, or some objects were read incorrectly after the OSD started, which could be repaired. Now deep-scrub cannot find the problem, which may be related to the loss (or deletion) of object metadata. After all, deep-scrub needs the object list of the current PG. If those 9 OSDs do not have the object metadata information, deep-scrub does not know the existence of this object. wu_chu...@qq.com wu_chu...@qq.com Original Email From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr >; Sent Time:2024/8/6 20:40 To:"wu_chulin"< wu_chu...@qq.com >; Subject:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause ofthedata miss in EC pool That's interesting. Have you tried to correlate any existing retrievable object to PG id and OSD mapping in order to verify the presence of each of these object's shards using ceph-objectore-tool on each one of its acting OSDs, for a
[ceph-users] Re: mds damaged with preallocated inodes that are inconsistent with inotable
On Thu, Aug 8, 2024 at 12:41 AM zxcs wrote: > > HI, Experts, > > we are running a cephfs with V16.2.*, and has multi active mds. Currently, we > are hitting a mds fs cephfs mds.* id damaged. and this mds always complain > > > “client *** loaded with preallocated inodes that are inconsistent with > inotable” > > > and the mds always suicide during replay. Could anyone please help here ? We > really need you shed some light! Could you share (debug) mds logs when it hits this during replay? > > > Thanks lot ! > > > xz > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Cheers, Venky ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io