I recently got mclock going literally an order of magnitude faster.  I would 
love to claim I found all the options myself but I collected the knowledge of 
what knobs I needed to turn from here.

Steps I took:
- Cleared all osd specific osd_mclock_max_capacity_iops settings.  The auto 
created ones were all over the place.  Some rust drives claimed 200 and others 
well over 5000.

- Set sane global osd_mclock_max_capacity_iops_hdd and 
osd_mclock_max_capacity_iops_ssd numbers for my average lowest performance 
drive performance in my environment  (your numbers will be different.  These 
are for 18t SAS seagate rust drives and micron 9100 6.4t NVMe)
     - osd                           basic     osd_mclock_max_capacity_iops_hdd 
     - osd                           basic     osd_mclock_max_capacity_iops_ssd 

- Set the profile to what I wanted my global default to be.
     - osd                           advanced  osd_mclock_profile               

- Tweaked the costs of doing operations  
      -osd                           dev       
osd_mclock_cost_per_byte_usec_hdd               1.000000
     - osd                           dev       
osd_mclock_cost_per_byte_usec_ssd               0.005000
I need to revisit the cost per byte settings.  Originally I was using just this 
knob to play with speeds but I quickly starting getting many slow ops along 
with faster speeds.   Then I pulled the max capacity iops down from 400 and 
finally settled where I am now.  I have room for improvement here but this is 
my prod cluster so.. yeah.

- Next I set specific faster drives to their own specific max capacity iops 
(optane drives I have for the metadata tier)
     - e.g.   osd.450                       basic     
osd_mclock_max_capacity_iops_ssd                785000.000000

- I also set the profile to specific drives in a tier I'm migrating to new 
spinners to "balanced" to speed that up.
     - e.g.    osd.789                       advanced  osd_mclock_profile       

I think that's about it.  I was not scientific AT ALL with this.   I just kept 
turning knobs a little and watching the recovery throughput and healthometer.  
On my cold EC tier rebalance I went from something like 150MB/s 20 obj/s to 
2.1GB/s 750 obj/s.  I know I'm pushing these drives pretty hard because I'm 
watching different drives claim 0 slow ops for N seconds, then a few min later 
clear.  My replicated tier now recovers ridiculously fast as well.

I'm looking forward to pulling all of this out and having ceph 
DoTheRightThing(tm) with recovery speeds.  We shall see.


Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 “End users is a description, not a goal.”

From: Dan van der Ster <dan.vanders...@clyso.com>
Sent: Thursday, July 6, 2023 6:04 PM
To: Jesper Krogh
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Cannot get backfill speed up

Hi Jesper,

Indeed many users reported slow backfilling and recovery with the mclock
scheduler. This is supposed to be fixed in the latest quincy but clearly
something is still slowing things down.
Some clusters have better luck reverting to osd_op_queue = wpq.

(I'm hoping by proposing this someone who tuned mclock recently will chime
in with better advice).

Cheers, Dan

Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com

On Wed, Jul 5, 2023 at 10:28 PM Jesper Krogh <jes...@krogh.cc> wrote:

> Hi.
> Fresh cluster - but despite setting:
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> recovery_max_active_ssd
> osd_recovery_max_active_ssd                      50
>                                                        mon
> default[20]
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> osd_max_backfills
> osd_max_backfills                                100
>                                                        mon
> default[10]
> I still get
> jskr@dkcphhpcmgt028:/$ sudo ceph status
>    cluster:
>      id:     5c384430-da91-11ed-af9c-c780a5227aff
>      health: HEALTH_OK
>    services:
>      mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028
> (age 16h)
>      mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys:
> dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd
>      mds: 2/2 daemons up, 1 standby
>      osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs
>    data:
>      volumes: 2/2 healthy
>      pools:   9 pools, 495 pgs
>      objects: 24.85M objects, 60 TiB
>      usage:   117 TiB used, 159 TiB / 276 TiB avail
>      pgs:     10655690/145764002 objects misplaced (7.310%)
>               474 active+clean
>               15  active+remapped+backfilling
>               6   active+remapped+backfill_wait
>    io:
>      client:   0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr
>      recovery: 328 MiB/s, 108 objects/s
>    progress:
>      Global Recovery Event (9h)
>        [==========================..] (remaining: 25m)
> With these numbers for the setting - I would expect to get more than 15
> active backfilling... (and based on SSD's and 2x25gbit network, I can
> also spend more resources on recovery than 328 MiB/s
> Thanks, .
> --
> Jesper Krogh
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to