[ceph-users] Re: Issues upgrading cephadm cluster from Octopus.

2022-11-19 Thread Adam King
I don't know for sure if it will fix the issue, but the migrations happen
based on a config option "mgr/cephadm/migration_current". You could try
setting that back to 0 and it would at least trigger the migrations to
happen again after restarting/failing over the mgr. They're meant to be
idempotent so in the worst case it just won't accomplish anything. Also,
you're correct about it not being in the docs. The migrations were intended
to be internal and never require user actions but it appears something has
gone wrong in this case.

On Fri, Nov 18, 2022 at 3:06 PM Seth T Graham  wrote:

> We have a cluster running Octopus (15.2.17) that I need to get updated and
> am getting cephadm failures when updating the managers, and have tried both
> Pacific and Quincy with the same results. The cluster was deployed with
> cephadm on centos stream 8 using podman and due to network isolation of the
> cluster the images are being pulled from a private registry. When I issue
> the 'ceph orch upgrade' command it starts out well by updating two of the
> three managers. When it gets to the point of transitioning to one of the
> upgraded managers the process stops with an error, with 'ceph status'
> reporting that the cephadm module has failed.
>
> Digging through the logs, I find a python stack trace that reads:
>
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 587, in serve
> serve.serve()
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 67, in serve
> self.convert_tags_to_repo_digest()
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 974, in
> convert_tags_to_repo_digest
> self._get_container_image_info(container_image_ref))
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async
> return self.event_loop.get_result(coro)
>   File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result
> return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
>   File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
> return self.__get_result()
>   File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in
> __get_result
> raise self._exception
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1374, in
> _get_container_image_info
> await self._registry_login(host,
> json.loads(str(self.mgr.get_store('registry_credentials'
>   File "/lib64/python3.6/json/__init__.py", line 354, in loads
> return _default_decoder.decode(s)
>   File "/lib64/python3.6/json/decoder.py", line 339, in decode
> obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>   File "/lib64/python3.6/json/decoder.py", line 357, in raw_decode
> raise JSONDecodeError("Expecting value", s, err.value) from None
> json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>
>
> Looking through the ceph config there is indeed no setting for the
> 'registry_credentials' value. Instead I have the registry_password,
> registry_url and registry_username values that were set when the cluster
> was provisioned.
>
> I do find mention of this key in the migrations.py script (lives in
> /usr/share/ceph/mgr/cephadm), under the function 'migrate_4_5' which reads
> to me like the old keys have been retired in favor of a unified key
> containing a json object. So I attempted to recreate what that function is
> doing by setting that key manually but unfortunately this didn't help.
>
> (eg, 'ceph config set mgr mgr/cephadm/registry_credentials '{ "url":
> "XXX", "username": "XXX", "password": "XXX" }'')
>
> I'm not sure where to go from here. Is there a 'migrate' option I can
> specify somewhere to properly upgrade this cluster, and perhaps run the
> code found in migrations.py? I don't see any mention of this in the
> documentation, but there's a lot of documentation so it's possible I missed
> it.
>
> Failing that, are there any suggestions for a workaround so I can get this
> upgrade completed?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issues upgrading cephadm cluster from Octopus.

2022-11-19 Thread Adam King
will also add since it could help resolve this, there is no
"mgr/cephadm/registry_json" config option. The whole reason for moving from
the previous 3 options to the new json object was actually to move it from
config options that can get spit out in logs to the config-key store where
it's a bit more secure, given we're working with a password. If you want to
see what it's currently set to you would do a "ceph config-key get
mgr/cephadm/registry_credentials" and then for setting it roughly the same
but swap get for set and provide the json string as you had been trying to
do before.

On Sat, Nov 19, 2022 at 8:05 AM Adam King  wrote:

> I don't know for sure if it will fix the issue, but the migrations happen
> based on a config option "mgr/cephadm/migration_current". You could try
> setting that back to 0 and it would at least trigger the migrations to
> happen again after restarting/failing over the mgr. They're meant to be
> idempotent so in the worst case it just won't accomplish anything. Also,
> you're correct about it not being in the docs. The migrations were intended
> to be internal and never require user actions but it appears something has
> gone wrong in this case.
>
> On Fri, Nov 18, 2022 at 3:06 PM Seth T Graham  wrote:
>
>> We have a cluster running Octopus (15.2.17) that I need to get updated
>> and am getting cephadm failures when updating the managers, and have tried
>> both Pacific and Quincy with the same results. The cluster was deployed
>> with cephadm on centos stream 8 using podman and due to network isolation
>> of the cluster the images are being pulled from a private registry. When I
>> issue the 'ceph orch upgrade' command it starts out well by updating two of
>> the three managers. When it gets to the point of transitioning to one of
>> the upgraded managers the process stops with an error, with 'ceph status'
>> reporting that the cephadm module has failed.
>>
>> Digging through the logs, I find a python stack trace that reads:
>>
>>   File "/usr/share/ceph/mgr/cephadm/module.py", line 587, in serve
>> serve.serve()
>>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 67, in serve
>> self.convert_tags_to_repo_digest()
>>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 974, in
>> convert_tags_to_repo_digest
>> self._get_container_image_info(container_image_ref))
>>   File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async
>> return self.event_loop.get_result(coro)
>>   File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result
>> return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
>>   File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
>> return self.__get_result()
>>   File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in
>> __get_result
>> raise self._exception
>>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1374, in
>> _get_container_image_info
>> await self._registry_login(host,
>> json.loads(str(self.mgr.get_store('registry_credentials'
>>   File "/lib64/python3.6/json/__init__.py", line 354, in loads
>> return _default_decoder.decode(s)
>>   File "/lib64/python3.6/json/decoder.py", line 339, in decode
>> obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>>   File "/lib64/python3.6/json/decoder.py", line 357, in raw_decode
>> raise JSONDecodeError("Expecting value", s, err.value) from None
>> json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>
>>
>> Looking through the ceph config there is indeed no setting for the
>> 'registry_credentials' value. Instead I have the registry_password,
>> registry_url and registry_username values that were set when the cluster
>> was provisioned.
>>
>> I do find mention of this key in the migrations.py script (lives in
>> /usr/share/ceph/mgr/cephadm), under the function 'migrate_4_5' which reads
>> to me like the old keys have been retired in favor of a unified key
>> containing a json object. So I attempted to recreate what that function is
>> doing by setting that key manually but unfortunately this didn't help.
>>
>> (eg, 'ceph config set mgr mgr/cephadm/registry_credentials '{ "url":
>> "XXX", "username": "XXX", "password": "XXX" }'')
>>
>> I'm not sure where to go from here. Is there a 'migrate' option I can
>> specify somewhere to properly upgrade this cluster, and perhaps run the
>> code found in migrations.py? I don't see any mention of this in the
>> documentation, but there's a lot of documentation so it's possible I missed
>> it.
>>
>> Failing that, are there any suggestions for a workaround so I can get
>> this upgrade completed?
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] backfilling kills rbd performance

2022-11-19 Thread Konold, Martin


Hi,

on a 3 node hyper converged pve cluster with 12 SSD osd devices I do 
experience stalls in the rbd performance during normal backfill 
operations e.g. moving a pool from 2/1 to 3/2.


I was expecting that I could control the load caused by the backfilling 
using


ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
or
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 1'
even
ceph tell 'osd.*' config set osd_recovery_sleep_ssd 2.1
did not help.

Any hints?

Normal operation looks like:
2022-11-19T16:16:52.142355+ mgr.pve-02 (mgr.18134134) 60414 : 
cluster [DBG] pgmap v59642: 576 pgs: 576 active+clean; 2.4 TiB data, 4.7 
TiB used, 12 TiB / 16 TiB avail; 3.3 KiB/s rd, 2.7 MiB/s wr, 63 op/s
2022-11-19T16:16:54.144082+ mgr.pve-02 (mgr.18134134) 60416 : 
cluster [DBG] pgmap v59643: 576 pgs: 576 active+clean; 2.4 TiB data, 4.7 
TiB used, 12 TiB / 16 TiB avail; 2.7 KiB/s rd, 1.3 MiB/s wr, 56 op/s


I am running Ceph Quincy 17.2.5 on a test system with dedicated 
1Gbit/9000MTU storage network, while the public ceph network 
1GBit/1500MTU is shared with the vm network.


I am looking forward to you suggestions.

Regards,
ppa. Martin Konold

--
Martin Konold - Prokurist, CTO
KONSEC GmbH -⁠ make things real
Amtsgericht Stuttgart, HRB 23690
Geschäftsführer: Andreas Mack
Im Köller 3, 70794 Filderstadt, Germany
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: backfilling kills rbd performance

2022-11-19 Thread Konold, Martin

Hi,

On 2022-11-19 17:32, Anthony D'Atri wrote:

I’m not positive that the options work with hyphens in them.  Try

ceph tell osd.* injectargs '--osd_max_backfills 1
--osd_recovery_max_active 1 --osd_recovery_max_single_start 1
--osd_recovery_op_priority=1'


Did so.


With Quincy the following should already be set, but to be sure:

ceph tell osd.* config set osd_op_queue_cut_off high


Did so too and even restarted all osd as it was recommended.

I then stopped a single osd in order to cause some backfilling.


What is network saturation like on that 1GE replication network?


Typically 100% saturated.


Operations like yours that cause massive data movement could easily
saturate a pipe that narrow.


Sure, but I am used to other setups where the recovery can be slowed 
down in order to keep the rbds operating.


To me it looks like all backfilling happens in parallel without any 
pauses in between which would benefit the client traffic.


I would expect some of those pgs in 
active+undersized+degraded+remapped+backfill_wait state instead of 
backfilling.


2022-11-19T16:58:50.139390+ mgr.pve-02 (mgr.18134134) 61735 : 
cluster [DBG] pgmap v60978: 576 pgs: 102 
active+undersized+degraded+remapped+backfilling, 474 active+clean; 2.4 
TiB data, 4.3 TiB used, 10 TiB / 15 TiB avail; 150 KiB/s wr, 10 op/s; 
123337/1272524 objects degraded (9.692%); 228 MiB/s, 58 objects/s 
recovering


Is this Quincy specific?

Regards
--martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io