[ceph-users] Re: Cephadm: unable to copy ceph.conf.new

2024-08-07 Thread Eugen Block

Hi,

I commented a similar issue a couple of months ago:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/

Can you check if that applies to your cluster?

Zitat von Magnus Larsen :


Hi Ceph-users!

Ceph version: ceph version 17.2.6  
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

Using cephadm to orchestrate the Ceph cluster

I’m running into https://tracker.ceph.com/issues/59189, which is  
fixed in next version—quincy 17.2.7—via

https://github.com/ceph/ceph/pull/50906

But I am unable to upgrade to the fixed version because of that bug

When I try to upgrade (using “ceph orch upgrade start –image  
internal_mirror/ceph:v17.2.7”), we see the same error message:
executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028',  
'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033',  
'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036',  
'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039',  
'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042',  
'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most  
recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line  
240, in _write_remote_file conn = await  
self._remote_connection(host, addr) File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp  
await source.run(srcpath) File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run  
self.handle_error(exc) File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in  
handle_error raise exc from None File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run  
await self._send_files(path, b'') File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in  
_send_files self.handle_error(exc) File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in  
handle_error raise exc from None File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in  
_send_files await self._send_file(srcpath, dstpath, attrs) File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in  
_send_file await self._make_cd_request(b'C', attrs, size, srcpath)  
File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in  
_make_cd_request self._fs.basename(path)) File  
"/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in  
make_request raise exc asyncssh.sftp.SFTPFailure: scp:  
/tmp/etc/ceph/ceph.conf.new: Permission denied During handling of  
the above exception, another exception occurred: Traceback (most  
recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line  
79, in do_work return f(*arg) File  
"/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files  
self._write_client_files(client_files, host) File  
"/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in  
_write_client_files self.mgr.ssh.write_remote_file(host, path,  
content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py",  
line 261, in write_remote_file  
self.mgr.wait_async(self._write_remote_file( File  
"/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async  
return self.event_loop.get_result(coro) File  
"/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return  
asyncio.run_coroutine_threadsafe(coro, self._loop).result() File  
"/lib64/python3.6/concurrent/futures/_base.py", line 432, in result  
return self.__get_result() File  
"/lib64/python3.6/concurrent/futures/_base.py", line 384, in  
__get_result raise self._exception File  
"/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in  
_write_remote_file logger.exception(msg)  
orchestrator._interface.OrchestratorError: Unable to write  
dkcphhpcmgt028:/etc/ceph/ceph.conf: scp:  
/tmp/etc/ceph/ceph.conf.new: Permission denied


We were thinking about removing the keyring from the Ceph  
orchestrator  
(https://docs.ceph.com/en/latest/cephadm/operations/#putting-a-keyring-under-management),
which would then make Ceph not try to copy over a new ceph.conf,  
alleviating the problem  
(https://docs.ceph.com/en/latest/cephadm/operations/#client-keyrings-and-configs),
but in doing so, Ceph will kindly remove the key from all nodes  
(https://docs.ceph.com/en/latest/cephadm/operations/#disabling-management-of-a-keyring-file)
leaving us without the admin keyring. So that doesn’t sound like a  
path we want to take :S


Does anybody know how to get around this issue, so I can get to  
version where the issue fixed for good?


Thanks,
Magnus
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pull failed on cluster upgrade

2024-08-07 Thread Nicola Mori
Unfortunately I'm on bare metal, with very old hardware so I cannot do 
much. I'd try to build a Ceph image based on Rocky Linux 8 if I could 
get the Dockerfile of the current image to start with, but I've not been 
able to find it. Can you please help me with this?

Cheers,

Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pull failed on cluster upgrade

2024-08-07 Thread Konstantin Shalygin
Hi,

> On 7 Aug 2024, at 10:31, Nicola Mori  wrote:
> 
> Unfortunately I'm on bare metal, with very old hardware so I cannot do much. 
> I'd try to build a Ceph image based on Rocky Linux 8 if I could get the 
> Dockerfile of the current image to start with, but I've not been able to find 
> it. Can you please help me with this?

You can try luck with packages, if I understood the problem correctly [1]
Apparently this is a problem of our time. Now the developer writes software for 
the container, and the fact that under the container there is hardware that the 
container is not happy with is "well, buy new hardware"


k
[1] https://github.com/ceph/ceph-build/pull/2272
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERN] Re: Pull failed on cluster upgrade

2024-08-07 Thread Dietmar Rieder

On 8/7/24 09:40, Konstantin Shalygin wrote:

Hi,


On 7 Aug 2024, at 10:31, Nicola Mori  wrote:

Unfortunately I'm on bare metal, with very old hardware so I cannot do much. 
I'd try to build a Ceph image based on Rocky Linux 8 if I could get the 
Dockerfile of the current image to start with, but I've not been able to find 
it. Can you please help me with this?


You can try luck with packages, if I understood the problem correctly [1]
Apparently this is a problem of our time. Now the developer writes software for the 
container, and the fact that under the container there is hardware that the container is 
not happy with is "well, buy new hardware"


It would be very helpful for Ceph admins if the upgrade routines first 
check if an upgrade is supported by the underlying hardware:


$ ceph orch upgrade start --ceph-version 

should fail in case of unsupported hw.

Just an idea

Dietmar



OpenPGP_signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm: unable to copy ceph.conf.new

2024-08-07 Thread Eugen Block

Hi,

please don't drop the ML from your response.

Is this the first upgrade you're attempting or did previous upgrades  
work with the current config?


I wonder if can generate a new ssh configuration for the root user,  
and then use that to upgrade to the fixed version.
The permissions will then be owned by root, which means we can't use  
the ceph user, no?


I do remember having an issue with non-root user on a customer  
cluster, but IIRC it was because of insufficient sudo permissions. In  
the end, they switched to root user, and there haven't been any issues  
since, at least nobody reported anything to me.

Do you mind sharing your sudo config for the ceph user?

Thanks,
Eugen

Zitat von Magnus Larsen :


Hi,

We do have client-keyring with the label:
# ceph orch client-keyring ls
ENTITYPLACEMENT MODE   OWNER  PATH
client.admin  label:_admin  rw---  0:0 
/etc/ceph/ceph.client.admin.keyring


And the SSH-config is also correct (verified just now) - though we  
use ceph as the user, not the default root,
which works normally, except that we can't upgrade until we get the  
fix in... which is in the next upgrade :<


I wonder if can generate a new ssh configuration for the root user,  
and then use that to upgrade to the fixed version.
The permissions will then be owned by root, which means we can't use  
the ceph user, no?


ref: https://docs.ceph.com/en/octopus/cephadm/operations/#ssh-configuration

Thanks!
Magnus Larsen


Fra: Eugen Block 
Sendt: 7. august 2024 09:15
Til: ceph-users@ceph.io 
Emne: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new

Hi,

I commented a similar issue a couple of months ago:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/

Can you check if that applies to your cluster?

Zitat von Magnus Larsen :


Hi Ceph-users!

Ceph version: ceph version 17.2.6
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Using cephadm to orchestrate the Ceph cluster

I’m running into https://tracker.ceph.com/issues/59189, which is
fixed in next version—quincy 17.2.7—via
https://github.com/ceph/ceph/pull/50906

But I am unable to upgrade to the fixed version because of that bug

When I try to upgrade (using “ceph orch upgrade start –image
internal_mirror/ceph:v17.2.7”), we see the same error message:
executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028',
'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033',
'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036',
'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039',
'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042',
'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most
recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line
240, in _write_remote_file conn = await
self._remote_connection(host, addr) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp
await source.run(srcpath) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run
self.handle_error(exc) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
handle_error raise exc from None File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run
await self._send_files(path, b'') File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in
_send_files self.handle_error(exc) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
handle_error raise exc from None File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in
_send_files await self._send_file(srcpath, dstpath, attrs) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in
_send_file await self._make_cd_request(b'C', attrs, size, srcpath)
File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in
_make_cd_request self._fs.basename(path)) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in
make_request raise exc asyncssh.sftp.SFTPFailure: scp:
/tmp/etc/ceph/ceph.conf.new: Permission denied During handling of
the above exception, another exception occurred: Traceback (most
recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line
79, in do_work return f(*arg) File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files
self._write_client_files(client_files, host) File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in
_write_client_files self.mgr.ssh.write_remote_file(host, path,
content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py",
line 261, in write_remote_file
self.mgr.wait_async(self._write_remote_file( File
"/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async
return self.event_loop.get_result(coro) File
"/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return
asyncio.run_coroutine_threadsafe(coro, self._loop).result() File
"/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result() File
"/lib64/python3.6/concurrent/futures/_base.py", line 384, in
__get_result raise self._exception Fil

[ceph-users] Re: Cephadm: unable to copy ceph.conf.new

2024-08-07 Thread Magnus Larsen
Hi,

Sorry! fixed.

The configuration is a follows:
root@management-node1 # cat /etc/sudoers.d/ceph
ceph ALL=(ALL)   NOPASSWD: ALL

So.. no restrictions :^)

Fra: Eugen Block 
Sendt: 7. august 2024 10:38
Til: Magnus Larsen 
Cc: ceph-users@ceph.io 
Emne: Re: Sv: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new

Hi,

please don't drop the ML from your response.

Is this the first upgrade you're attempting or did previous upgrades
work with the current config?

> I wonder if can generate a new ssh configuration for the root user,
> and then use that to upgrade to the fixed version.
> The permissions will then be owned by root, which means we can't use
> the ceph user, no?

I do remember having an issue with non-root user on a customer
cluster, but IIRC it was because of insufficient sudo permissions. In
the end, they switched to root user, and there haven't been any issues
since, at least nobody reported anything to me.
Do you mind sharing your sudo config for the ceph user?

Thanks,
Eugen

Zitat von Magnus Larsen :

> Hi,
>
> We do have client-keyring with the label:
> # ceph orch client-keyring ls
> ENTITYPLACEMENT MODE   OWNER  PATH
> client.admin  label:_admin  rw---  0:0
> /etc/ceph/ceph.client.admin.keyring
>
> And the SSH-config is also correct (verified just now) - though we
> use ceph as the user, not the default root,
> which works normally, except that we can't upgrade until we get the
> fix in... which is in the next upgrade :<
>
> I wonder if can generate a new ssh configuration for the root user,
> and then use that to upgrade to the fixed version.
> The permissions will then be owned by root, which means we can't use
> the ceph user, no?
>
> ref: https://docs.ceph.com/en/octopus/cephadm/operations/#ssh-configuration
>
> Thanks!
> Magnus Larsen
>
> 
> Fra: Eugen Block 
> Sendt: 7. august 2024 09:15
> Til: ceph-users@ceph.io 
> Emne: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new
>
> Hi,
>
> I commented a similar issue a couple of months ago:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/
>
> Can you check if that applies to your cluster?
>
> Zitat von Magnus Larsen :
>
>> Hi Ceph-users!
>>
>> Ceph version: ceph version 17.2.6
>> (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
>> Using cephadm to orchestrate the Ceph cluster
>>
>> I’m running into https://tracker.ceph.com/issues/59189, which is
>> fixed in next version—quincy 17.2.7—via
>> https://github.com/ceph/ceph/pull/50906
>>
>> But I am unable to upgrade to the fixed version because of that bug
>>
>> When I try to upgrade (using “ceph orch upgrade start –image
>> internal_mirror/ceph:v17.2.7”), we see the same error message:
>> executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028',
>> 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033',
>> 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036',
>> 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039',
>> 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042',
>> 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most
>> recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line
>> 240, in _write_remote_file conn = await
>> self._remote_connection(host, addr) File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp
>> await source.run(srcpath) File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run
>> self.handle_error(exc) File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
>> handle_error raise exc from None File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run
>> await self._send_files(path, b'') File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in
>> _send_files self.handle_error(exc) File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
>> handle_error raise exc from None File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in
>> _send_files await self._send_file(srcpath, dstpath, attrs) File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in
>> _send_file await self._make_cd_request(b'C', attrs, size, srcpath)
>> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in
>> _make_cd_request self._fs.basename(path)) File
>> "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in
>> make_request raise exc asyncssh.sftp.SFTPFailure: scp:
>> /tmp/etc/ceph/ceph.conf.new: Permission denied During handling of
>> the above exception, another exception occurred: Traceback (most
>> recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line
>> 79, in do_work return f(*arg) File
>> "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files
>> self._write_client_files(client_files, host) File
>> "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in
>> _write_client_files self.mgr.ssh.write_remote_file(host, path,
>> content, mode, uid, gid) File "/usr/share/cep

[ceph-users] Re: Cephadm: unable to copy ceph.conf.new

2024-08-07 Thread Eugen Block

And are any of the hosts shown as offline in the 'ceph orch host ls' output?
Is this the first upgrade you're attempting or did previous upgrades  
work with the current config?



Zitat von Magnus Larsen :


Hi,

Sorry! fixed.

The configuration is a follows:
root@management-node1 # cat /etc/sudoers.d/ceph
ceph ALL=(ALL)   NOPASSWD: ALL

So.. no restrictions :^)

Fra: Eugen Block 
Sendt: 7. august 2024 10:38
Til: Magnus Larsen 
Cc: ceph-users@ceph.io 
Emne: Re: Sv: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new

Hi,

please don't drop the ML from your response.

Is this the first upgrade you're attempting or did previous upgrades
work with the current config?


I wonder if can generate a new ssh configuration for the root user,
and then use that to upgrade to the fixed version.
The permissions will then be owned by root, which means we can't use
the ceph user, no?


I do remember having an issue with non-root user on a customer
cluster, but IIRC it was because of insufficient sudo permissions. In
the end, they switched to root user, and there haven't been any issues
since, at least nobody reported anything to me.
Do you mind sharing your sudo config for the ceph user?

Thanks,
Eugen

Zitat von Magnus Larsen :


Hi,

We do have client-keyring with the label:
# ceph orch client-keyring ls
ENTITYPLACEMENT MODE   OWNER  PATH
client.admin  label:_admin  rw---  0:0
/etc/ceph/ceph.client.admin.keyring

And the SSH-config is also correct (verified just now) - though we
use ceph as the user, not the default root,
which works normally, except that we can't upgrade until we get the
fix in... which is in the next upgrade :<

I wonder if can generate a new ssh configuration for the root user,
and then use that to upgrade to the fixed version.
The permissions will then be owned by root, which means we can't use
the ceph user, no?

ref: https://docs.ceph.com/en/octopus/cephadm/operations/#ssh-configuration

Thanks!
Magnus Larsen


Fra: Eugen Block 
Sendt: 7. august 2024 09:15
Til: ceph-users@ceph.io 
Emne: [ceph-users] Re: Cephadm: unable to copy ceph.conf.new

Hi,

I commented a similar issue a couple of months ago:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IQX2VXA6QQQPEZQ7GU3QY2WPHAIVPIUN/

Can you check if that applies to your cluster?

Zitat von Magnus Larsen :


Hi Ceph-users!

Ceph version: ceph version 17.2.6
(d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Using cephadm to orchestrate the Ceph cluster

I’m running into https://tracker.ceph.com/issues/59189, which is
fixed in next version—quincy 17.2.7—via
https://github.com/ceph/ceph/pull/50906

But I am unable to upgrade to the fixed version because of that bug

When I try to upgrade (using “ceph orch upgrade start –image
internal_mirror/ceph:v17.2.7”), we see the same error message:
executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028',
'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033',
'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036',
'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039',
'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042',
'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most
recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line
240, in _write_remote_file conn = await
self._remote_connection(host, addr) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp
await source.run(srcpath) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run
self.handle_error(exc) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
handle_error raise exc from None File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run
await self._send_files(path, b'') File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in
_send_files self.handle_error(exc) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
handle_error raise exc from None File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in
_send_files await self._send_file(srcpath, dstpath, attrs) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in
_send_file await self._make_cd_request(b'C', attrs, size, srcpath)
File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in
_make_cd_request self._fs.basename(path)) File
"/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in
make_request raise exc asyncssh.sftp.SFTPFailure: scp:
/tmp/etc/ceph/ceph.conf.new: Permission denied During handling of
the above exception, another exception occurred: Traceback (most
recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line
79, in do_work return f(*arg) File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files
self._write_client_files(client_files, host) File
"/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in
_write_client_files self.mgr.ssh.write_remote_file(host, path,
content, mode, uid, gid) File "/usr/share/ceph/mgr/cephad

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-07 Thread Nicola Mori
Thank you Konstantin, as it was foreseeable this problem didn't hit just 
me. So I hope the build of images based on CentOS Stream 8 will be 
resumed. Otherwise I'll try to build myself.


Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multi-Site sync error with multipart objects: Resource deadlock avoided

2024-08-07 Thread Tino Lehnig
Hi,

We've been trying to set up multi-site sync on two test VMs before rolling 
things out on actual production hardware. Both are running Ceph 18.2.4 deployed 
via cephadm. Host OS is Debian 12, container runtime is podman (switched from 
Debian 11 and docker.io, same error there). There is only one RGW daemon on 
each site. Ceph config is pretty much defaults. One thing I did change was 
setting rgw_relaxed_region_enforcement to true because the zonegroup got 
renamed from "default" during the switch to multi-site using the dashboard's 
assistant. There's nothing special like server-side encryption either. Our end 
goal is to replicate all RGW data from our current cluster to a new one.

The Multi-Site configuration itself went pretty smoothly through the dashboard 
and pre-existing data started syncing right away. Unfortunately, not all 
objects made it. To be precise, none of the larger objects over the multipart 
threshold got synced. This is consistent for newly uploaded multipart objects 
as well. Curiously, it's working fine in the other direction, i.e. multipart 
uploads from the secondary zone do get synced to the master.

Here are some relevant logs:

>From `radosgw-admin sync error list`:

{
"shard_id": 26,
"entries": [
{
"id": "1_1722598249.479766_23730.1",
"section": "data",
"name": 
"foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7/logstash_1%3a8.12.2-1_amd64.deb",
"timestamp": "2024-08-02T11:30:49.479766Z",
"info": {
"source_zone": "5160b406-4428-4fdc-9c5d-5ec9fe9404c0",
"error_code": 35,
"message": "failed to sync object(35) Resource deadlock 
avoided"
}
}
]
},



>From RGW on the receiving end:

Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.474+ 7f3a6243e640  0 rgw async rados processor: 
store->fetch_remote_obj() returned r=-35
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.474+ 7f3a36b7b640  2 req 7168648379339657593 
0.0s :list_data_changes_log normalizing buckets and tenants
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.474+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log init permissions
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log recalculating target
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log reading permissions
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log init op
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log verifying op mask
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log verifying op permissions
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 overriding permissions due to 
system operation
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a36b7b640  2 req 7168648379339657593 
0.003999872s :list_data_changes_log verifying op params
Aug 02 13:30:49 dev-ceph-single bash[754387]: debug 
2024-08-02T11:30:49.478+ 7f3a5241e640  0 
RGW-SYNC:data:sync:shard[28]:entry[foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7[0]]:bucket_sync_sources[source=foobar:new[5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3]):7:source_zone=5160b406-4428-4fdc-9c5d-5ec9fe9404c0]:bucket[foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3<-foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7]:inc_sync[foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7]:entry[logstash_1%3a8.12.2-1_amd64.deb]:
 ERROR: failed to sync object: 
foobar/new:5160b406-4428-4fdc-9c5d-5ec9fe9404c0.12564119.3:7/logstash_1%3a8.12.2-1_amd64.deb



And from the sender:

Aug 02 13:30:49 test-ceph-single bash[885118]: debug 
2024-08-02T11:30:49.476+ 7f0acfdb2640  1 == req done req=0x7f0ab50e4710 
op status=-104 http_status=200 latency=0.419986606s ==
Aug 02 13:30:49 test-ceph-single bash[885118]: debug 
2024-08-02T11:30:49.476+ 7f0ba9f66640  2 req 5943847843579143466 
0.0s initializing for trans_id = 
tx0527cca1f3381a52a-0066acc369-c052e6-eu2
Aug 02 13:30:49 test-ceph-single bash[885118]: debug 
2024-08-02T11:30:49.476+ 7f0acfdb2640  1 beast: 0x7f0ab50e4710: 
10.139.0.151 - synchronization-user [02/Aug/2024:11:30:49.056 +] "GET

[ceph-users] Re: Cephadm: unable to copy ceph.conf.new

2024-08-07 Thread Adam King
It might be worth trying to manually upgrade one of the mgr daemons. If you
go to the host with a mgr and edit the
/var/lib/ceph///unit.run so that the image specified
in the long podman/docker run command in there is the 17.2.7 image. Then
just restart its systemd unit (don't tell the orchestrator to do the
restart of the mgr. That can cause your change to the unit.run fiel to be
overwritten). If you only have two mgr daemons you should be able to use
failovers to make that one the active mgr at which point the active mgr
will have the patch that fixes this issue and you should be able to get the
upgrade going. `ceph orch daemon redeploy  --image <17.2.7
image> might also work, but I tend to find the manual steps are more
reliable for this sort of issue as you don't have to worry about issues
within the orchestrator causing that operation to fail.

On Tue, Aug 6, 2024 at 7:26 PM Magnus Larsen 
wrote:

> Hi Ceph-users!
>
> Ceph version: ceph version 17.2.6
> (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
> Using cephadm to orchestrate the Ceph cluster
>
> I’m running into https://tracker.ceph.com/issues/59189, which is fixed in
> next version—quincy 17.2.7—via
> https://github.com/ceph/ceph/pull/50906
>
> But I am unable to upgrade to the fixed version because of that bug
>
> When I try to upgrade (using “ceph orch upgrade start –image
> internal_mirror/ceph:v17.2.7”), we see the same error message:
> executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028',
> 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034',
> 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038',
> 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042',
> 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call
> last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in
> _write_remote_file conn = await self._remote_connection(host, addr) File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await
> source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py",
> line 458, in run self.handle_error(exc) File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error
> raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py",
> line 456, in run await self._send_files(path, b'') File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files
> self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py",
> line 307, in handle_error raise exc from None File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files
> await self._send_file(srcpath, dstpath, attrs) File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file
> await self._make_cd_request(b'C', attrs, size, srcpath) File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in
> _make_cd_request self._fs.basename(path)) File
> "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request
> raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/etc/ceph/ceph.conf.new:
> Permission denied During handling of the above exception, another exception
> occurred: Traceback (most recent call last): File
> "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg)
> File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files
> self._write_client_files(client_files, host) File
> "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files
> self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File
> "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file
> self.mgr.wait_async(self._write_remote_file( File
> "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return
> self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py",
> line 56, in get_result return asyncio.run_coroutine_threadsafe(coro,
> self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py",
> line 432, in result return self.__get_result() File
> "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
> raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249,
> in _write_remote_file logger.exception(msg)
> orchestrator._interface.OrchestratorError: Unable to write
> dkcphhpcmgt028:/etc/ceph/ceph.conf: scp: /tmp/etc/ceph/ceph.conf.new:
> Permission denied
>
> We were thinking about removing the keyring from the Ceph orchestrator (
> https://docs.ceph.com/en/latest/cephadm/operations/#putting-a-keyring-under-management
> ),
> which would then make Ceph not try to copy over a new ceph.conf,
> alleviating the problem (
> https://docs.ceph.com/en/latest/cephadm/operations/#client-keyrings-and-configs
> ),
> but in doing so, Ceph will kindly remove the key from all nodes (
> https://docs.ceph.com/en/latest/cephadm/operations/#disabling-management-of-a-keyring-file
> )
> leaving us without the admin keyring. So that doesn’t sound like a path we
> want to take :S

[ceph-users] mds damaged with preallocated inodes that are inconsistent with inotable

2024-08-07 Thread zxcs
HI, Experts, 

we are running a cephfs with V16.2.*, and has multi active mds. Currently, we 
are hitting  a mds fs cephfs mds.*  id damaged. and this mds always complain 


“client  *** loaded with preallocated inodes that are inconsistent with 
inotable”


and the mds always suicide during replay. Could anyone please help here ? We 
really need you shed some light!


Thanks lot !


xz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Any way to put the rate limit on rbd flatten operation?

2024-08-07 Thread Henry lol
Hello,

AFAIK, massive rx/tx occurs on the client side for the flatten operation.
so, I want to control the network rate limit or predict the network
bandwidth it will consume.
Is there any way to do that?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can you return orphaned objects to a bucket?

2024-08-07 Thread Frédéric Nass
Hi,

You're right. The object reindex subcommand backport was rejected for P and is 
still pending for Q and R. [1]

Use rgw-restore-bucket-index script instead.

Regards,
Frédéric.

[1] https://tracker.ceph.com/issues/61405


De : vuphun...@gmail.com
Envoyé : mercredi 7 août 2024 01:38
À : ceph-users@ceph.io
Objet : [ceph-users] Re: Can you return orphaned objects to a bucket?

Hi, 
Currently I see it only supports the latest version, is there any way to 
support old versions like Pacific or Quincy? 
___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW sync gets stuck every day

2024-08-07 Thread Eugen Block

Hi,

Redeploying stuff seems like a much too big hammer to get things  
going again. Surely there must be something more reasonable?


wouldn't a restart suffice?
Do you see anything in the 'radosgw-admin sync error list'? Maybe an  
error prevents the sync from continuing?



Zitat von Olaf Seibert :


Hi all,

we have some Ceph clusters with RGW replication between them. It  
seems that in the last month at least, it gets stuck at around the  
same time ~every day. Not 100% the same time, and also not 100% of  
the days, but in the more recent days seem to happen more, and for  
longer.


With "stuck" I mean that the "oldest incremental change not applied"  
is getting 5 or more minutes old, and not changing. In the past this  
seemed to resolve itself in a short time, but recently it didn't. It  
remained stuck at the same place for several hours. Also, on several  
different occasions I noticed that the shard number in question was  
the same.


We are using Ceph 18.2.2, image id 719d4c40e096.

The output on one end looks like this (I redacted out some of the  
data because I don't know how much of the naming would be sensitive  
information):


root@zone2:/# radosgw-admin sync status --rgw-realm backup
  realm ----8ddf4576ebab (backup)
  zonegroup ----58af9051e063 (backup)
   zone ----e1223ae425a4 (zone2-backup)
   current time 2024-08-04T10:22:00Z
zonegroup features enabled: resharding
   disabled: compress-encrypted
  metadata sync no sync (zone is master)
  data sync source: ----e8db1c51b705 (zone1-backup)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [30,90,95]
oldest incremental change not applied:  
2024-08-04T10:05:54.015403+ [30]


while on the other side it looks ok (not more than half a minute behind):

root@zone1:/# radosgw-admin sync status --rgw-realm backup
  realm ----8ddf4576ebab (backup)
  zonegroup ----58af9051e063 (backup)
   zone ----e8db1c51b705 (zone1-backup)
   current time 2024-08-04T10:23:05Z
zonegroup features enabled: resharding
   disabled: compress-encrypted
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: ----e1223ae425a4 (zone2-backup)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 4 shards
behind shards: [89,92,95,98]
oldest incremental change not applied:  
2024-08-04T10:22:53.175975+ [95]



With some experimenting, we found that redeploying the RGWs on this  
side resolves the situation: "ceph orch redeploy rgw.zone1-backup".  
The shards go into "Recovering" state and after a short time it is  
"caught up with source" as well.


Redeploying stuff seems like a much too big hammer to get things  
going again. Surely there must be something more reasonable?


Also, any ideas about how we can find out what is causing this? It  
may be that some customer has some job running every 24 hours, but  
that shouldn't cause the replication to get stuck.


Thanks in advance,

--
Olaf Seibert
Site Reliability Engineer

SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/

Current system status always at:
https://www.syseleven-status.net/

Company headquarters: Berlin
Registered court: AG Berlin Charlottenburg, HRB 108571 Berlin
Managing directors: Andreas Hermann, Jens Ihlenfeld, Norbert Müller,  
Jens Plogsties

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Please guide us inidentifying thecause ofthedata miss in EC pool

2024-08-07 Thread Frédéric Nass
Hi Chulin,

Are you 100% sure that 494, 1169 and 1057 (that did not restart) were in the 
acting set at the exact moment the power outage occured? 

I'm asking because min_size 6 would have allowed the data to be written to 
eventually 6 crashing OSDs.

Bests,
Frédéric.



De : Best Regards 
Envoyé : jeudi 8 août 2024 08:10
À : Frédéric Nass
Cc: ceph-users 
Objet : Re:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying 
thecause ofthedata miss in EC pool

Hi, Frédéric Nass


Thank you for your continued attention and guidance. Let's analyze and verify 
this issue from different perspectives.


The reason why we did not stop the investigation is that we tried to find other 
ways to avoid the losses caused by this sudden failure. Turning off the disk 
cache is the last option, of course, this operation will only be carried out 
after finding definite evidence.

I also have a question that among the 9 OSDs, some have not been restarted. In 
theory, these OSDs should retain the object info(metadata,pg_log,etc.), even if 
the object cannot be recovered. I sorted out the OSD booting log where the 
object should be located and the PG peering process:


OSD 494/1169/1057 has been in the running state, and osd.494 was the primary of 
the acting_set during the failure. However, no record of the object was found 
using `ceph-object-tool --op list or --op log` in, so the loss of data due to 
disk cache loss does not seem to be a complete explanation (perhaps there is 
some processing logic that we have not paid attention to).





Best Regards,



Woo
wu_chu...@qq.com

Best Regards






   
Original Email
   
 

From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr >;

Sent Time:2024/8/8 4:01

To:"wu_chulin"< wu_chu...@qq.com >;

Subject:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying 
thecause ofthedata miss in EC pool


Hey Chulin,


Looks clearer now.
 


Non-persistent cache for KV metadata and Bluestore metadata certainly explains 
how data was lost without the cluster even noticing.


What's unexpected is data staying for so long in the disks buffers and not 
being written to persistent sectors at all.


Anyways, thank you for sharing your use case and investigation. It was nice 
chatting with you.


If you can, share this in the ceph-user list. It will for sure benefit everyone 
in the community.


Best regards,
Frédéric.


PS : Note that using min_size >= k + 1 on EC pools is recommended (so as 
min_size >= 2 on rep X3 pools) because you don't want to write data without 
any parity chunks.









De : wu_chu...@qq.com
Envoyé : mercredi 7 août 2024 11:30
À : Frédéric Nass
Objet : Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying 
thecause ofthedata miss in EC pool




Hi,
Yes, after the file -> object -> PG -> OSD correspondence is found, 
the object record can be found on the specified OSD using the command 
`ceph-objectore-tool --op list `

The pool min_size is 6


The business department reported more than 30, but we proactively screened out 
more than 100. The upload time of the lost files was mainly distributed about 3 
hours before the failure, and these files were successfully downloaded after 
being uploaded (RGW log).


One OSD corresponds to one disk, and no separate space is allocated for WAL/DB.


The HDD cache is the default (SATA is enabled by default), and the hard disk 
cache has not been forcibly turned off due to performance issues.


The loss of OSD data due to the loss of hard disk cache was our initial 
inference, and the initial explanation provided to the business department was 
the same. When the cluster was restored, ceph reported 12 unfound objects, 
which is acceptable. After all, most devices were powered off abnormally, and 
it is difficult to ensure the integrity of all data. Up to now, our team have 
not located how the data was lost. In the past, when the hard disk hardware was 
damaged, either the OSD could not start because of damaged key data, or some 
objects were read incorrectly after the OSD started, which could be repaired. 
Now deep-scrub cannot find the problem, which may be related to the loss (or 
deletion) of object metadata. After all, deep-scrub needs the object list of 
the current PG. If those 9 OSDs do not have the object metadata information, 
deep-scrub does not know the existence of this object.



wu_chu...@qq.com
wu_chu...@qq.com








Original Email



From:"Frédéric Nass"< frederic.n...@univ-lorraine.fr >;

Sent Time:2024/8/6 20:40

To:"wu_chulin"< wu_chu...@qq.com >;

Subject:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause 
ofthedata miss in EC pool




That's interesting.


Have you tried to correlate any existing retrievable object to PG id and OSD 
mapping in order to verify the presence of each of these object's shards using 
ceph-objectore-tool on each one of its acting OSDs, for a

[ceph-users] Re: mds damaged with preallocated inodes that are inconsistent with inotable

2024-08-07 Thread Venky Shankar
On Thu, Aug 8, 2024 at 12:41 AM zxcs  wrote:
>
> HI, Experts,
>
> we are running a cephfs with V16.2.*, and has multi active mds. Currently, we 
> are hitting  a mds fs cephfs mds.*  id damaged. and this mds always complain
>
>
> “client  *** loaded with preallocated inodes that are inconsistent with 
> inotable”
>
>
> and the mds always suicide during replay. Could anyone please help here ? We 
> really need you shed some light!

Could you share (debug) mds logs when it hits this during replay?

>
>
> Thanks lot !
>
>
> xz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io