[ceph-users] Question regarding Quincy mclock scheduler.

2022-11-09 Thread philippe

Hi,
We have a quincy 17.2.5 based cluster, and we have some question
regarding the mclock iops scheduler.
Looking into the documentation, the default profile is the HIGH_CLIENT_OPS
that mean that 50% of IOPS for an OSD are reserved for clients operations.

But looking into OSD configuration settings, it seems that this is not 
the case or probably there is something I don't understand.


ceph config get osd.0 osd_mclock_profile
high_client_ops



ceph config show osd.0 | grep mclock
osd_mclock_max_capacity_iops_hdd 22889.222997   mon
osd_mclock_scheduler_background_best_effort_lim  99 default
osd_mclock_scheduler_background_best_effort_res  1144 
default

osd_mclock_scheduler_background_best_effort_wgt  2 default
osd_mclock_scheduler_background_recovery_lim 4578 
default
osd_mclock_scheduler_background_recovery_res 1144 
default

osd_mclock_scheduler_background_recovery_wgt 1 default
osd_mclock_scheduler_client_lim  99 
default
osd_mclock_scheduler_client_res  2289 
default

osd_mclock_scheduler_client_wgt  2 default

So i have osd_mclock_max_capacity_iops_hdd = 22889.222997 why 
osd_mclock_scheduler_client_res is not 11444 ?


this value seem strange to me.

Kr
Philippe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to ... alertmanager and prometheus

2022-11-09 Thread Eugen Block
The only thing I noticed was that I had to change the grafana-api-url  
for the dashboard when I stopped one of the two grafana instances. I  
wasn't able to test the dashboard before because I had to wait for new  
certificates so my browser wouldn't complain about the cephadm cert.  
So it seems as if the failover doesn't work entirely automatic, but  
it's not too much work to switch the api url. :-)


Zitat von Michael Lipp :


Thank you both very much! I have understood things better now.

I'm not sure, though, whether all URIs are adjusted properly when  
changing the placement of the services. Still testing...


Am 08.11.22 um 17:13 schrieb Redouane Kachach Elhichou:

Welcome Eugen,

There are some ongoing efforts to make the whole prometheus stack config
more dynamic by using the http sd configuration [1]. In fact part of the
changes are already in main but they will not be available till the next
Ceph official release.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config



On Tue, Nov 8, 2022 at 4:47 PM Eugen Block  wrote:


I somehow missed the HA part in [1], thanks for pointing that out.


Zitat von Redouane Kachach Elhichou :


If you are running quincy and using cephadm then you can have more
instances of prometheus (and other monitoring daemons) running in HA mode
by increasing the number of daemons as in [1]:

from a cephadm shell (to run 2 instances of prometheus and

altertmanager):

ceph orch apply prometheus --placement 'count:2'
ceph orch apply alertmanager --placement 'count:2'

You can have as many instances as you need. You can choose on which nodes
to place them by using the daemon placement specification of cephadm [2]

by

using a specific label for monitoring i.e. In case of mgr failover

cephadm

should reconfigure the daemons accordingly.

[1]


https://docs.ceph.com/en/quincy/cephadm/services/monitoring/#deploying-monitoring-with-cephadm

[2] https://docs.ceph.com/en/quincy/cephadm/services/#daemon-placement

Hope it helps,
Redouane.




On Tue, Nov 8, 2022 at 3:58 PM Eugen Block  wrote:


Hi,

the only information I found so far was this statement from the redhat
docs [1]:


When multiple services of the same type are deployed, a
highly-available setup is deployed.

I tried to do that in a virtual test environment (16.2.7) and it seems
to work as expected.

ses7-host1:~ # ceph orch ps --daemon_type prometheus
NAME   HOSTPORTS   STATUS   REFRESHED
AGE  MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
prometheus.ses7-host1  ses7-host1  running (6h)   12s ago
12M 165M-  2.18.0   8eb9f2694232  04a0b33e2474
prometheus.ses7-host2  ses7-host2  *:9095  host is offline89s ago
  6h 236M-   8eb9f2694232  0cb070cea4eb

host2 was the active mgr before I shut it down, but I still have
access to prometheus metrics as well as active alerts from
alertmanager, there's also one spare instance running, the same
applies for grafana:

ses7-host1:~ # ceph orch ps --daemon_type alertmanager
NAME HOSTPORTSSTATUS
REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
alertmanager.ses7-host1  ses7-host1   running (6h)
42s ago  12M33.7M-  0.16.2   903e9b49157e  5a4ffc9a79da
alertmanager.ses7-host2  ses7-host2  *:9093,9094  running (102s)
44s ago   6h35.5M-   903e9b49157e  71ac3c636a6b

ses7-host1:~ # ceph orch ps --daemon_type prometheus
NAME   HOSTPORTS   STATUS  REFRESHED
AGE  MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
prometheus.ses7-host1  ses7-host1  running (6h)  44s ago
12M 156M-  2.18.0   8eb9f2694232  04a0b33e2474
prometheus.ses7-host2  ses7-host2  *:9095  running (104s)47s ago
6h 250M-   8eb9f2694232  87a5a8349f05

ses7-host1:~ # ceph orch ps --daemon_type grafana
NAMEHOSTPORTS   STATUS  REFRESHED  AGE
  MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
grafana.ses7-host1  ses7-host1  running (6h)  47s ago  12M
99.6M-  7.1.531b52dc794e2  7935ecf47b38
grafana.ses7-host2  ses7-host2  *:3000  running (107s)49s ago   6h
 108M-  7.1.531b52dc794e2  17dea034bb33

I just specified two hosts in the placement section of each service
and deployed them. I think this should be mentioned in the ceph docs
(not only redhat).

[1]



https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html/operations_guide/management-of-monitoring-stack-using-the-ceph-orchestrator

Zitat von Michael Lipp :


Hi,

I've just setup a test cluster with cephadm using quincy. Things
work nicely. However, I'm not sure how to "handle" alertmanager and
prometheus.

Both services obviously aren't crucial to the working of the
storage, fine. But th

[ceph-users] Re: How to ... alertmanager and prometheus

2022-11-09 Thread Sake Paulusma
Hi

I noticed that cephadm would update the grafana-frontend-api-url with version 
17.2.3, but it looks broken with version 17.2.5. It isn't a big deal to update 
the url by myself, but it's quite irritating to do if in the past it corrected 
itself.

Best regards,
Sake

From: Eugen Block 
Sent: Wednesday, November 9, 2022 9:26:28 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: How to ... alertmanager and prometheus

The only thing I noticed was that I had to change the grafana-api-url
for the dashboard when I stopped one of the two grafana instances. I
wasn't able to test the dashboard before because I had to wait for new
certificates so my browser wouldn't complain about the cephadm cert.
So it seems as if the failover doesn't work entirely automatic, but
it's not too much work to switch the api url. :-)

Zitat von Michael Lipp :

> Thank you both very much! I have understood things better now.
>
> I'm not sure, though, whether all URIs are adjusted properly when
> changing the placement of the services. Still testing...
>
> Am 08.11.22 um 17:13 schrieb Redouane Kachach Elhichou:
>> Welcome Eugen,
>>
>> There are some ongoing efforts to make the whole prometheus stack config
>> more dynamic by using the http sd configuration [1]. In fact part of the
>> changes are already in main but they will not be available till the next
>> Ceph official release.
>>
>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fprometheus.io%2Fdocs%2Fprometheus%2Flatest%2Fconfiguration%2Fconfiguration%2F%23http_sd_config&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=G9tOki9%2FzRHSJXMU4BlcaQjtscEkNWKXIG1TGCGR14Y%3D&reserved=0
>> 
>>
>>
>> On Tue, Nov 8, 2022 at 4:47 PM Eugen Block  wrote:
>>
>>> I somehow missed the HA part in [1], thanks for pointing that out.
>>>
>>>
>>> Zitat von Redouane Kachach Elhichou :
>>>
 If you are running quincy and using cephadm then you can have more
 instances of prometheus (and other monitoring daemons) running in HA mode
 by increasing the number of daemons as in [1]:

 from a cephadm shell (to run 2 instances of prometheus and
>>> altertmanager):
> ceph orch apply prometheus --placement 'count:2'
> ceph orch apply alertmanager --placement 'count:2'
 You can have as many instances as you need. You can choose on which nodes
 to place them by using the daemon placement specification of cephadm [2]
>>> by
 using a specific label for monitoring i.e. In case of mgr failover
>>> cephadm
 should reconfigure the daemons accordingly.

 [1]

>>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Fquincy%2Fcephadm%2Fservices%2Fmonitoring%2F%23deploying-monitoring-with-cephadm&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PUm4%2FZ6%2B19uSureq%2Bn47bGAlfs%2BA9TLrZop%2BR%2F0o5kA%3D&reserved=0
 [2] 
 https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Fquincy%2Fcephadm%2Fservices%2F%23daemon-placement&data=05%7C01%7C%7C6b20dfcbe4864afa800408dac22c2be0%7C84df9e7fe9f640afb435%7C1%7C0%7C638035792261472298%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I3eWKVLUV8wfxiVyJe1X1NzC0wCNjF%2F13WemeBTEsc0%3D&reserved=0

 Hope it helps,
 Redouane.




 On Tue, Nov 8, 2022 at 3:58 PM Eugen Block  wrote:

> Hi,
>
> the only information I found so far was this statement from the redhat
> docs [1]:
>
>> When multiple services of the same type are deployed, a
>> highly-available setup is deployed.
> I tried to do that in a virtual test environment (16.2.7) and it seems
> to work as expected.
>
> ses7-host1:~ # ceph orch ps --daemon_type prometheus
> NAME   HOSTPORTS   STATUS   REFRESHED
> AGE  MEM USE  MEM LIM  VERSION  IMAGE ID  CONTAINER ID
> prometheus.ses7-host1  ses7-host1  running (6h)   12s ago
> 12M 165M-  2.18.0   8eb9f2694232  04a0b33e2474
> prometheus.ses7-host2  ses7-host2  *:9095  host is offline8

[ceph-users] Re: Recent ceph.io Performance Blog Posts

2022-11-09 Thread Stefan Kooman

On 11/8/22 21:20, Mark Nelson wrote:

Hi Folks,

I thought I would mention that I've released a couple of performance 
articles on the Ceph blog recently that might be of interest to people:


For sure, thanks a lot, it's really informative!

Can we also ask for special requests? One of the things that would help 
us (and CephFS users in general) is how performance of CephFS for small 
files (~512 bytes, 2k up to say 64K) is impacted by the amount of PGs a 
CephFS metadata pool has.


Question that might be answered:

- does it help to provision more PGs for workloads that rely heavily on 
OMAP usage by the MDS (or is RocksDB the bottleneck in all cases)?


Tests that might be useful:

- rsync (single threaded, worst case)
- fio random read / write tests with varying io depths and threads
- The CephFS devs might know some performance tests in this context

One of the tricky things with doing these benchmarks, is that the PG 
placement over the OSDs might heavily impact performance all by itself, 
as primary PGs are not placed in the same way with different amount of 
PGs in the pool. Therefore, ideally, the primaries are balanced as 
evenly possible. I'm eagerly awaiting the Ceph virtual 2022 talk "New 
workload balancer in Ceph". Having the primaries balanced before these 
benchmarks run seems to be a prerequisite to do a "apples to apples" 
comparison.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Large strange flip in storage accounting

2022-11-09 Thread Frank Schilder
Hi all,

during maintenance yesterday we observed something extremely strange on our 
production cluster. We needed to rebalance storage from slow to fast SSDs in 
small pools. The pools affected by this operation were con-rbd-meta-hpc-one, 
con-fs2-meta1 and con-fs2-meta2 (see ceph df output below). We changed the 
device class in the crush rule to move the data and something very strange 
happened (first instance). In addition to the pools that we were moving to 
different disks, also pool sr-rbd-data-one-perf in a completely different 
sub-tree on a completely different device class showed 3 remapped PGs. I don't 
know how this is even possible, but, well.

While editing the crush rule we had norebalance set and let peering finish 
before data movement. We also wanted to check the new mappings before letting 
data move. After un-setting norebalance the 3 PGs on sr-rbd-data-one-perf 
became clean in what seemed to be an instant. In addition to that, something 
very strange happened again (second instance). The output of ceph df changed 
immediately and completely.

Before starting the data movement it would look like this:

--- RAW STORAGE ---
CLASS SIZE AVAILUSED RAW USED  %RAW USED
hdd11 PiB  8.2 PiB  3.0 PiB   3.0 PiB  27.16
rbd_data  262 TiB  154 TiB  105 TiB   108 TiB  41.23
rbd_perf   31 TiB   19 TiB   12 TiB12 TiB  38.93
ssd   8.4 TiB  7.1 TiB   15 GiB   1.3 TiB  15.09
TOTAL  12 PiB  8.3 PiB  3.1 PiB   3.2 PiB  27.50
 
--- POOLS ---
POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
sr-rbd-meta-one 1   128  7.9 GiB   12.94k  7.9 GiB   0.01 25 TiB
sr-rbd-data-one 2  2048   75 TiB   26.83M   75 TiB  43.67 72 TiB
sr-rbd-one-stretch  3   160  222 GiB   68.81k  222 GiB   0.29 25 TiB
con-rbd-meta-hpc-one750   54 KiB   61   54 KiB  0 12 TiB
con-rbd-data-hpc-one8   150   36 GiB9.42k   36 GiB  05.0 PiB
sr-rbd-data-one-hdd11   560  126 TiB   33.52M  126 TiB  36.82162 TiB
con-fs2-meta1  12   256  241 GiB   40.74M  241 GiB   0.659.0 TiB
con-fs2-meta2  13  1024  0 B  362.90M  0 B  09.0 TiB
con-fs2-data   14  1350  1.1 PiB  407.17M  1.1 PiB  14.425.0 PiB
con-fs2-data-ec-ssd17   128  386 GiB6.57M  386 GiB   1.03 29 TiB
ms-rbd-one 18   256  416 GiB  166.81k  416 GiB   0.54 25 TiB
con-fs2-data2  19  8192  1.3 PiB  534.20M  1.3 PiB  16.804.6 PiB
sr-rbd-data-one-perf   20  4096  4.3 TiB1.13M  4.3 TiB  19.965.8 TiB
device_health_metrics  21 1  196 MiB  994  196 MiB  0 25 TiB

Immediately after starting the data movement (well, after the 3 strange PGs 
were clean) it started looking like this:

--- RAW STORAGE ---
CLASS SIZE AVAILUSED RAW USED  %RAW USED
fs_meta   8.7 TiB  8.7 TiB  3.8 GiB62 GiB   0.69
hdd11 PiB  8.2 PiB  3.0 PiB   3.0 PiB  27.19
rbd_data  262 TiB  154 TiB  105 TiB   108 TiB  41.30
rbd_perf   31 TiB   19 TiB   12 TiB12 TiB  38.94
TOTAL  12 PiB  8.3 PiB  3.2 PiB   3.2 PiB  27.51
 
--- POOLS ---
POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
sr-rbd-meta-one 1   128  8.1 GiB   12.94k   20 GiB   0.03 26 TiB
sr-rbd-data-one 2  2048  103 TiB   26.96M  102 TiB  50.97 74 TiB
sr-rbd-one-stretch  3   160  262 GiB   68.81k  614 GiB   0.78 26 TiB
con-rbd-meta-hpc-one750  4.6 MiB   61   14 MiB  02.7 TiB
con-rbd-data-hpc-one8   150   36 GiB9.42k   41 GiB  05.0 PiB
sr-rbd-data-one-hdd11   560  131 TiB   33.69M  214 TiB  49.99161 TiB
con-fs2-meta1  12   256  421 GiB   40.74M  1.6 TiB  16.782.0 TiB
con-fs2-meta2  13  1024  0 B  362.89M  0 B  02.0 TiB
con-fs2-data   14  1350  1.1 PiB  407.17M  1.2 PiB  16.275.0 PiB
con-fs2-data-ec-ssd17   128  564 GiB6.57M  588 GiB   1.57 29 TiB
ms-rbd-one 18   256  637 GiB  166.82k  1.2 TiB   1.60 26 TiB
con-fs2-data2  19  8192  1.3 PiB  534.21M  1.6 PiB  20.374.6 PiB
sr-rbd-data-one-perf   20  4096  4.3 TiB1.13M   12 TiB  41.415.7 TiB
device_health_metrics  21 1  207 MiB  994  620 MiB  0 26 TiB

The columns stored and used show completely different numbers now. In fact, I 
believe the new numbers are correct, because they match much better with the 
%used of the fullest OSD in the respective pools and the column used reflects 
the replication factor correctly.

This cluster was upgraded recently from mimic to octopus. Any idea what could 
have triggered this change in accounting and what numbers I should believe in?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...

[ceph-users] Re: Recent ceph.io Performance Blog Posts

2022-11-09 Thread Eshcar Hillel
Hi Mark,

Thanks for posting these blogs. They are very interesting to read.
Maybe you have an answer to a question I asked in the dev list:

We run fio benchmark against a 3-node ceph cluster with 96 OSDs. Objects are 
4kb. We use gdbpmp profiler https://github.com/markhpc/gdbpmp to analyze the 
threads' performance.
we discovered the bstore_kv_sync thread is always busy, while all 16 tp_osd_tp 
threads are not busy most of the time (wait on a conditional variable or a 
lock).
Given that 3 rocksdb CFs are sharded, and sharding is configurable, why not run 
multiple (3) bstore_kv_sync threads? they won't have conflicts most of the time.
This has the potential of removing the rocksdb bottleneck and increasing IOPS.

Can you explain this design choice?


From: Mark Nelson 
Sent: Tuesday, November 8, 2022 10:20 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Recent ceph.io Performance Blog Posts

CAUTION: External Sender

Hi Folks,

I thought I would mention that I've released a couple of performance
articles on the Ceph blog recently that might be of interest to people:

 1.
https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

 2.
https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/

 3.
https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/


The first covers RocksDB tuning. How we arrived at our defaults, an
analysis of some common settings that have been floating around on the
mailing list, and potential new settings that we are considering making
default in the future.

The second covers how to tune QEMU/KVM with librbd to achieve high
single-client performance on a small (30 OSD) NVMe backed cluster. This
article also covers the performance impact when enabling 128bit AES
over-the-wire encryption.

The third covers per-OSD CPU/Core scaling and the kind of IOPS/core and
IOPS/NVMe numbers that are achievable both on a single OSD and on a
larger (60 OSD) NVMe cluster. In this case there are enough clients and
a high enough per-client iodepth to saturate the OSD(s).

I hope these are helpful or at least interesting!

Thanks,
Mark

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question regarding Quincy mclock scheduler.

2022-11-09 Thread Aishwarya Mathuria
Hello Philippe,

Your understanding is correct, 50% of IOPS are reserved for client
operations.
osd_mclock_max_capacity_iops_hdd defines the capacity per OSD.

There is a mClock queue for each OSD shard. The number of shards are
defined by osd_op_num_shards_hdd

which by default is set to 5.
So each queue has osd_mclock_max_capacity_iops_hdd/osd_op_num_shards_hdd
IOPS.
In your case, this would mean that the capacity of each mClock queue is
equal to 4578 IOPS (22889/5)
This would make osd_mclock_scheduler_client_res = 2289 (50%)

I hope that explains the numbers you are seeing.

We also have a few optimizations
 coming with regards to the
mClock profiles where we are increasing the reservation for client
operations in the high client profile. This change will also address the
inflated osd capacity numbers that are encountered in some cases with osd
bench.

Let me know if you have any other questions.

Regards,
Aishwarya

On Wed, Nov 9, 2022 at 1:46 PM philippe  wrote:

> Hi,
> We have a quincy 17.2.5 based cluster, and we have some question
> regarding the mclock iops scheduler.
> Looking into the documentation, the default profile is the HIGH_CLIENT_OPS
> that mean that 50% of IOPS for an OSD are reserved for clients operations.
>
> But looking into OSD configuration settings, it seems that this is not
> the case or probably there is something I don't understand.
>
> ceph config get osd.0 osd_mclock_profile
> high_client_ops
>
>
>
> ceph config show osd.0 | grep mclock
> osd_mclock_max_capacity_iops_hdd 22889.222997   mon
> osd_mclock_scheduler_background_best_effort_lim  99 default
> osd_mclock_scheduler_background_best_effort_res  1144
> default
> osd_mclock_scheduler_background_best_effort_wgt  2 default
> osd_mclock_scheduler_background_recovery_lim 4578
> default
> osd_mclock_scheduler_background_recovery_res 1144
> default
> osd_mclock_scheduler_background_recovery_wgt 1 default
> osd_mclock_scheduler_client_lim  99
> default
> osd_mclock_scheduler_client_res  2289
> default
> osd_mclock_scheduler_client_wgt  2 default
>
> So i have osd_mclock_max_capacity_iops_hdd = 22889.222997 why
> osd_mclock_scheduler_client_res is not 11444 ?
>
> this value seem strange to me.
>
> Kr
> Philippe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recent ceph.io Performance Blog Posts

2022-11-09 Thread Mark Nelson

On 11/9/22 6:03 AM, Eshcar Hillel wrote:

Hi Mark,

Thanks for posting these blogs. They are very interesting to read.
Maybe you have an answer to a question I asked in the dev list:

We run fio benchmark against a 3-node ceph cluster with 96 OSDs. 
Objects are 4kb. We use 
gdbpmp profilerhttps://github.com/markhpc/gdbpmp 
 to analyze the threads' performance.
we discovered the bstore_kv_sync thread is always busy, while all 16 
tp_osd_tp threads are not busy most of the time (wait on a conditional 
variable or a lock).
Given that 3 rocksdb CFs are sharded, and sharding is configurable, 
why not run multiple (3) bstore_kv_sync threads? they won't have 
conflicts most of the time.
This has the potential of removing the rocksdb bottleneck and 
increasing IOPS.


Can you explain this design choice?



You are absolutely correct that the bstore_kv_sync thread can often be a 
bottleneck during 4K random writes.  Typically it's not so bad that the 
tp_osd_tp threads are mostly blocked though (feel free to send me a copy 
of the trace, I would be interested in seeing it).  Years ago I 
advocated for the same approach you are suggesting here.  The fear at 
the time was that the changes inside bluestore would be too disruptive.  
The column family sharding approach could be (and was) mostly contained 
to the KeyValueDB glue code.  Column family sharding has been a win from 
the standpoint that it helps us avoid really deep LSM hierarchies in 
RocksDB.  We tend to see faster compaction times and are more likely to 
keep full levels on the fast device.  Sadly it doesn't really help with 
improving metadata throughput and may even introduce a small amount of 
overhead during the WAL flush process.  FWIW slow bstore_kv_sync is one 
of the reasons that people will some times run multiple OSDs on one NVMe 
drive (sometimes it's faster, sometimes it's not).



Maybe a year ago I tried to sort of map out the changes that I thought 
would be necessary to shard across KeyValueDBs inside bluestore itself.  
It didn't look impossible, but would require quite a bit of work (and a 
bit of finesse to restructure the data path).  There's a legitimate 
questions of whether or not it's worth it now to make those kinds of 
changes to bluestore or invest in crimson and seastore at this point.  
We ended up deciding not to pursue the changes back then.  I think if we 
changed our minds it would most likely go into some kind of experimental 
bluestore v2 project (along with other things like hierarchical storage) 
so we don't screw up the existing code base.






*From:* Mark Nelson 
*Sent:* Tuesday, November 8, 2022 10:20 PM
*To:* ceph-users@ceph.io 
*Subject:* [ceph-users] Recent ceph.io Performance Blog Posts
CAUTION: External Sender

Hi Folks,

I thought I would mention that I've released a couple of performance
articles on the Ceph blog recently that might be of interest to people:

 1.
https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/
    
 2.
https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/
    
 3.
https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/
    

The first covers RocksDB tuning. How we arrived at our defaults, an
analysis of some common settings that have been floating around on the
mailing list, and potential new settings that we are considering making
default in the future.

The second covers how to tune QEMU/KVM with librbd to achieve high
single-client performance on a small (30 OSD) NVMe backed cluster. This
article also covers the performance impact when enabling 128bit AES
over-the-wire encryption.

The third covers per-OSD CPU/Core scaling and the kind of IOPS/core and
IOPS/NVMe numbers that are achievable both on a single OSD and on a
larger (60 OSD) NVMe cluster. In this case there are enough clients and
a high enough per-client iodepth to saturate the OSD(s).

I hope these are helpful or at least interesting!

Thanks,
Mark

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recent ceph.io Performance Blog Posts

2022-11-09 Thread Mark Nelson

On 11/9/22 4:48 AM, Stefan Kooman wrote:


On 11/8/22 21:20, Mark Nelson wrote:

Hi Folks,

I thought I would mention that I've released a couple of performance 
articles on the Ceph blog recently that might be of interest to people:


For sure, thanks a lot, it's really informative!

Can we also ask for special requests? One of the things that would 
help us (and CephFS users in general) is how performance of CephFS for 
small files (~512 bytes, 2k up to say 64K) is impacted by the amount 
of PGs a CephFS metadata pool has.


That's an interesting question.  I wouldn't really expect the metadata 
pool PG count to have a dramatic effect here at counts that result in 
reasonable pseudo-random distribution.  Have you seen otherwise?





Question that might be answered:

- does it help to provision more PGs for workloads that rely heavily 
on OMAP usage by the MDS (or is RocksDB the bottleneck in all cases)?


Tests that might be useful:

- rsync (single threaded, worst case)
- fio random read / write tests with varying io depths and threads
- The CephFS devs might know some performance tests in this context


FWIW I wrote the libcephfs backend code for the IOR and mdtest 
benchmarks used in the IO500 test suite.  Typically I've seen that 
libcephfs and kernel cephfs are competitive with RBD for small random 
writes over a small file set.  It's when you balloon to huge numbers of 
directories/files that CephFS can have problems with the way dirfrags 
are distributed across active MDSes. Directory pinning can help here if 
you have files nicely distributed across lots of directories.  If you 
have a lot of files in a single directory it can become a problem.





One of the tricky things with doing these benchmarks, is that the PG 
placement over the OSDs might heavily impact performance all by 
itself, as primary PGs are not placed in the same way with different 
amount of PGs in the pool. Therefore, ideally, the primaries are 
balanced as evenly possible. I'm eagerly awaiting the Ceph virtual 
2022 talk "New workload balancer in Ceph". Having the primaries 
balanced before these benchmarks run seems to be a prerequisite to do 
a "apples to apples" comparison.


There can be an effect to having poor primary distributions across OSDs, 
but typically it's been subtle in my experience at moderately high PG 
counts.  The balancer work is certainly interesting though, especially 
when can't have or don't want a lot of PGs.





Gr. Stefan



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to check available storage with EC and different sized OSD's ?

2022-11-09 Thread Paweł Kowalski
If I start to use all available space that pool can offer (4.5T) and 
first OSD (2.7T) fails, I'm sure I'll end up with lost data since it's 
not possible to fit 4.5T on 2 remaining drives with total raw capacity 
of 3.6T.


I'm wondering why ceph isn't complaining now. I thought it should place 
data among disks in that way, that loosing any OSD would keep data safe 
for RO. (by wasting excessive 0.9T capacity on the first drive)



Oh, and here's my rule and profile - by mistake I've sent it on PM:


rule ceph3_ec_low_k2_m1-data {
    id 2
    type erasure
    min_size 3
    max_size 3
    step set_chooseleaf_tries 5
    step set_choose_tries 100
    step take default class low_hdd
    step choose indep 0 type osd
    step emit
}

crush-device-class=low_hdd
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8


Paweł


W dniu 8.11.2022 o 15:47, Danny Webb pisze:

with a m value of 1 if you lost a single OSD/failure domain you'd end up with a 
read only pg or cluster.  usually you need at least k+1 to survive a failure 
domain failure depending on your min_size setting.  The other thing you need to 
take into consideration is that the m value is for both failure domain *and* 
osd in an unlucky scenario (eg, you had a pg that happened to be on a downed 
host and a failed OSD elsewhere in the cluster).For a 3 OSD configuration 
the minimum fault tolerant setup would be k=1, m=2 and you effectively then are 
doing replica 3 anyways.  At least this is my understanding of it.  Hope that 
helps


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to check available storage with EC and different sized OSD's ?

2022-11-09 Thread Danny Webb
With a 3 osd pool it's not possible for data to be redistributed on failure of 
an OSD.  with a K=2,M=1 value your minimum number of OSDs for distributions 
sake is 3.  If you need the ability to redistribute data on failure you'd need 
a 4th OSD.  You k/m value can't be larger than your failure domain and if it's 
set exactly to your failure domain you'll never redistribute data on failure.

From: Paweł Kowalski 
Sent: 09 November 2022 15:14
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: How to check available storage with EC and different 
sized OSD's ?

CAUTION: This email originates from outside THG

If I start to use all available space that pool can offer (4.5T) and
first OSD (2.7T) fails, I'm sure I'll end up with lost data since it's
not possible to fit 4.5T on 2 remaining drives with total raw capacity
of 3.6T.

I'm wondering why ceph isn't complaining now. I thought it should place
data among disks in that way, that loosing any OSD would keep data safe
for RO. (by wasting excessive 0.9T capacity on the first drive)


Oh, and here's my rule and profile - by mistake I've sent it on PM:


rule ceph3_ec_low_k2_m1-data {
 id 2
 type erasure
 min_size 3
 max_size 3
 step set_chooseleaf_tries 5
 step set_choose_tries 100
 step take default class low_hdd
 step choose indep 0 type osd
 step emit
}

crush-device-class=low_hdd
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8


Paweł


W dniu 8.11.2022 o 15:47, Danny Webb pisze:
> with a m value of 1 if you lost a single OSD/failure domain you'd end up with 
> a read only pg or cluster.  usually you need at least k+1 to survive a 
> failure domain failure depending on your min_size setting.  The other thing 
> you need to take into consideration is that the m value is for both failure 
> domain *and* osd in an unlucky scenario (eg, you had a pg that happened to be 
> on a downed host and a failed OSD elsewhere in the cluster).For a 3 OSD 
> configuration the minimum fault tolerant setup would be k=1, m=2 and you 
> effectively then are doing replica 3 anyways.  At least this is my 
> understanding of it.  Hope that helps
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


Danny Webb
Principal OpenStack Engineer
The Hut Group

Tel:
Email: danny.w...@thehutgroup.com

For the purposes of this email, the "company" means The Hut Group Limited, a 
company registered in England and Wales (company number 6539496) whose 
registered office is at Fifth Floor, Voyager House, Chicago Avenue, Manchester 
Airport, M90 3DQ and/or any of its respective subsidiaries.

Confidentiality Notice
This e-mail is confidential and intended for the use of the named recipient 
only. If you are not the intended recipient please notify us by telephone 
immediately on +44(0)1606 811888 or return it to us by e-mail. Please then 
delete it from your system and note that any use, dissemination, forwarding, 
printing or copying is strictly prohibited. Any views or opinions are solely 
those of the author and do not necessarily represent those of the company.

Encryptions and Viruses
Please note that this e-mail and any attachments have not been encrypted. They 
may therefore be liable to be compromised. Please also note that it is your 
responsibility to scan this e-mail and any attachments for viruses. We do not, 
to the extent permitted by law, accept any liability (whether in contract, 
negligence or otherwise) for any virus infection and/or external compromise of 
security and/or confidentiality in relation to transmissions sent by e-mail.

Monitoring
Activity and use of the company's systems is monitored to secure its effective 
use and operation and for other lawful business purposes. Communications using 
these systems will also be monitored and may be recorded to secure effective 
use and operation and for other lawful business purposes.

hgvyjuv
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to check available storage with EC and different sized OSD's ?

2022-11-09 Thread Paweł Kowalski
I don't need to redistribute data after OSD failure. All I want to do in 
this test setup is to keep data safe in RO after such failure.


Paweł



W dniu 9.11.2022 o 17:09, Danny Webb pisze:

With a 3 osd pool it's not possible for data to be redistributed on failure of 
an OSD.  with a K=2,M=1 value your minimum number of OSDs for distributions 
sake is 3.  If you need the ability to redistribute data on failure you'd need 
a 4th OSD.  You k/m value can't be larger than your failure domain and if it's 
set exactly to your failure domain you'll never redistribute data on failure.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rook mgr module failing

2022-11-09 Thread Mikhail Sidorov
Hello!
I tried to turn on rook ceph mgr module like so:

ceph mgr module enable rook
ceph orch set backend rook

But after that I started getting errors:

500 - Internal Server Error
The server encountered an unexpected condition which prevented it from
fulfilling the request.

Mgr logs show this traceback:

debug 2022-11-07T15:39:59.931+ 7f0dae860700  0 [dashboard ERROR
exception] Internal Server Error
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47,
in dashboard_exception_handler
return handler(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py",
line 54, in __call__
return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py",
line 258, in inner
ret = func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/orchestrator.py",
line 33, in _inner
return method(self, *args, **kwargs)
  File "/usr/lib64/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 506,
in inventory
return get_inventories(None, refresh)
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 251,
in get_inventories
for host in orch.inventory.list(hosts=hosts, refresh=do_refresh)]
  File "/usr/share/ceph/mgr/dashboard/services/orchestrator.py", line
38, in inner
raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in
raise_if_exception
raise e
Exception: No storage class exists matching name provided in ceph
config at mgr/rook/storage_class

Is it a bug, or am I missing something?
I am using quay.io/ceph/ceph:v17.2.5
I tried to use quay.io/ceph/ceph:v17.2.5-20221017 and services tab on
dashboard started working, but when I switch to devices tab it still shows
the same error
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] iscsi target lun error

2022-11-09 Thread Randy Morgan
I am trying to create a second iscsi target and I keep getting an error 
when I create the second target:



   Failed to update target 'iqn.2001-07.com.ceph:1667946365517'

disk create/update failed on host.containers.internal. LUN allocation 
failure


I am running ceph Pacific: *Version*
16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)

All of the information I can find on this problem is from 3 years ago 
and doesn't seem to apply any more.  Does anyone know how to correct 
this problem?


Randy

--
Randy Morgan
IT Manager
Department of Chemistry/BioChemistry
Brigham Young University
ran...@chem.byu.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iscsi target lun error

2022-11-09 Thread Xiubo Li


On 10/11/2022 02:21, Randy Morgan wrote:
I am trying to create a second iscsi target and I keep getting an 
error when I create the second target:



   Failed to update target 'iqn.2001-07.com.ceph:1667946365517'

disk create/update failed on host.containers.internal. LUN allocation 
failure


I think you were using the cephadm to add the iscsi targets, not the 
gwcli or Rest APIs directly.


Before we hit other issues were login failures, that because there were 
two gateways using the same IP address. Please share your `gwcli ls` 
output to see what the 'host.containers.internal' gateway's config.


Thanks!



I am running ceph Pacific: *Version*
16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
pacific (stable)

All of the information I can find on this problem is from 3 years ago 
and doesn't seem to apply any more.  Does anyone know how to correct 
this problem?


Randy



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Best practice for removing failing host from cluster?

2022-11-09 Thread Matt Larson
We have a Ceph cluster running Octopus v 15.2.3 , and 1 of 12 of the hosts
in the cluster has started having what appears to be a hardware issue
causing it to freeze.  This began with a freeze and reported 'CATERR' in
the server logs. The host has been having repeated freeze issues over the
last week.

I'm looking to safely isolate this host from the cluster while
troubleshooting.  I started trying to remove OSDs from the host with `ceph
orch osd rm XX` for one of the drives on the node to rebalance the data
from the host.  The host is now having difficulties remaining online for
extended periods of time, and so I was planning to remove this host from
the cluster / to remove all the remaining OSDs from the node.  What would
be the best way to do this?

Should I use `ceph orch osd rm XX` for each of the OSDs of this host
or should I set the weights of each of the OSDs as 0?  Can I do this while
the host is offline, or should I bring it online first before setting
weights or using `ceph orch osd rm`?

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2022-11-09 Thread Venky Shankar
Hi Olli,

On Mon, Oct 17, 2022 at 1:08 PM Olli Rajala  wrote:
>
> Hi Patrick,
>
> With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops" and/or
> "ceph tell mds.pve-core-1 objecter_requests"? Both these show very few
> requests/ops - many times just returning empty lists. I'm pretty sure that
> this I/O isn't generated by any clients - I've earlier tried to isolate
> this by shutting down all cephfs clients and this didn't have any
> noticeable effect.
>
> I tried to watch what is going on with that "perf dump" but to be honest
> all I can see is some numbers going up in the different sections :)
> ...don't have a clue what to focus on and how to interpret that.
>
> Here's a perf dump if you or anyone could make something out of that:
> https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673

You'd need to capture this over a period of time to see what ops might
be going through and what the mds is doing.

>
> Tnx!
> o.
>
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
>
>
> On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly 
> wrote:
>
> > Hello Olli,
> >
> > On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala  wrote:
> > >
> > > Hi,
> > >
> > > I'm seeing constant 25-50MB/s writes to the metadata pool even when all
> > > clients and the cluster is idling and in clean state. This surely can't
> > be
> > > normal?
> > >
> > > There's no apparent issues with the performance of the cluster but this
> > > write rate seems excessive and I don't know where to look for the
> > culprit.
> > >
> > > The setup is Ceph 16.2.9 running in hyperconverged 3 node core cluster
> > and
> > > 6 hdd osd nodes.
> > >
> > > Here's typical status when pretty much all clients are idling. Most of
> > that
> > > write bandwidth and maybe fifth of the write iops is hitting the
> > > metadata pool.
> > >
> > >
> > ---
> > > root@pve-core-1:~# ceph -s
> > >   cluster:
> > > id: 2088b4b1-8de1-44d4-956e-aa3d3afff77f
> > > health: HEALTH_OK
> > >
> > >   services:
> > > mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 (age 2w)
> > > mgr: pve-core-1(active, since 4w), standbys: pve-core-2, pve-core-3
> > > mds: 1/1 daemons up, 2 standby
> > > osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
> > >
> > >   data:
> > > volumes: 1/1 healthy
> > > pools:   10 pools, 625 pgs
> > > objects: 70.06M objects, 46 TiB
> > > usage:   95 TiB used, 182 TiB / 278 TiB avail
> > > pgs: 625 active+clean
> > >
> > >   io:
> > > client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
> > >
> > ---
> > >
> > > Here's some daemonperf dump:
> > >
> > >
> > ---
> > > root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
> > >
> > mds-
> > > --mds_cache--- --mds_log-- -mds_mem- ---mds_server---
> > mds_
> > > -objecter-- purg
> > > req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa cfa  hcc
> > hccd
> > > hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr  hcs  hsr  cre
> > >  cat |sess|actv rd   wr   rdwr|purg|
> > >  4000  767k  78k   0001610055
> > >  37 |1.1k   00 | 17  3.7k 1340 |767k 767k| 40500
> > >  0 |110 |  42   210 |  2
> > >  5720  767k  78k   0003   16300   11   11
> > >  0   17 |1.1k   00 | 45  3.7k 1370 |767k 767k| 57800
> > >  0 |110 |  02   280 |  4
> > >  5740  767k  78k   0004   34400   34   33
> > >  2   26 |1.0k   00 |134  3.9k 1390 |767k 767k| 57   1300
> > >  0 |110 |  02  1120 | 19
> > >  6730  767k  78k   0006   32600   22   22
> > >  0   32 |1.1k   00 | 78  3.9k 1410 |767k 768k| 67400
> > >  0 |110 |  02   560 |  2
> > >
> > ---
> > > Any ideas where to look at?
> >
> > Check the perf dump output of the mds:
> >
> > ceph tell mds.:0 perf dump
> >
> > over a period of time to identify what's going on. You can also look
> > at the objecter_ops (another tell command) for the MDS.
> >
> > --
> > Patrick Donnelly, Ph.D.
> > He / Him / His
> > Principal Software Engineer
> > Red Hat, Inc.
> > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Cheers,
Venky

__

[ceph-users] Re: Question regarding Quincy mclock scheduler.

2022-11-09 Thread philippe

Hi,
Thanks a lot for the clarification, so we will adapt or setup
using custom profile with the proposed parameters.

Kr
Philippe

On 9/11/22 14:04, Aishwarya Mathuria wrote:

Hello Philippe,

Your understanding is correct, 50% of IOPS are reserved for client 
operations.

osd_mclock_max_capacity_iops_hdd defines the capacity per OSD.

There is a mClock queue for each OSD shard. The number of shards are 
defined by osd_op_num_shards_hdd 
 which by default is set to 5.
So each queue has osd_mclock_max_capacity_iops_hdd/osd_op_num_shards_hdd 
IOPS.
In your case, this would mean that the capacity of each mClock queue is 
equal to 4578 IOPS (22889/5)

This would make osd_mclock_scheduler_client_res = 2289 (50%)

I hope that explains the numbers you are seeing.

We also have a few optimizations 
 coming with regards to 
the mClock profiles where we are increasing the reservation for client 
operations in the high client profile. This change will also address the 
inflated osd capacity numbers that are encountered in some cases with 
osd bench.


Let me know if you have any other questions.

Regards,
Aishwarya

On Wed, Nov 9, 2022 at 1:46 PM philippe > wrote:


Hi,
We have a quincy 17.2.5 based cluster, and we have some question
regarding the mclock iops scheduler.
Looking into the documentation, the default profile is the
HIGH_CLIENT_OPS
that mean that 50% of IOPS for an OSD are reserved for clients
operations.

But looking into OSD configuration settings, it seems that this is not
the case or probably there is something I don't understand.

ceph config get osd.0 osd_mclock_profile
high_client_ops



ceph config show osd.0 | grep mclock
osd_mclock_max_capacity_iops_hdd                 22889.222997       mon
osd_mclock_scheduler_background_best_effort_lim  99 default
osd_mclock_scheduler_background_best_effort_res  1144
default
osd_mclock_scheduler_background_best_effort_wgt  2   
  default

osd_mclock_scheduler_background_recovery_lim     4578
default
osd_mclock_scheduler_background_recovery_res     1144
default
osd_mclock_scheduler_background_recovery_wgt     1   
  default

osd_mclock_scheduler_client_lim                  99
default
osd_mclock_scheduler_client_res                  2289
default
osd_mclock_scheduler_client_wgt                  2   
  default


So i have osd_mclock_max_capacity_iops_hdd = 22889.222997 why
osd_mclock_scheduler_client_res is not 11444 ?

this value seem strange to me.

Kr
Philippe
___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practice for removing failing host from cluster?

2022-11-09 Thread Robert Sander

On 10.11.22 03:19, Matt Larson wrote:


Should I use `ceph orch osd rm XX` for each of the OSDs of this host
or should I set the weights of each of the OSDs as 0?  Can I do this while
the host is offline, or should I bring it online first before setting
weights or using `ceph orch osd rm`?


I would set all OSDs of this host to "out" first.
This way the cluster still knows about them and is able to utilize them 
when doing the data movement to the other OSDs.


After they are really empty you can purge them and remove the host from 
the cluster.


Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io