[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Anthony D'Atri Tue, 29 Apr 2025 13:53:56 -0700


> 
> In order to get our PG sizes better aligned we doubled the number of PGs on 
> the pool with the largest PG size. The pool is HDD with DB/WAL on SATA SSD 
> and HDD sizes between 2TB and 20TB and PG size was ~140GB before the doubling.



Please send `ceph osd dump | grep pool`

> 
>    osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767 remapped pgs
> 
>    pools:   18 pools, 25249 pgs
>    objects: 683.85M objects, 1.6 PiB
>    usage:   2.7 PiB used, 1.9 PiB / 4.5 PiB avail
>    pgs:     842769842/3951610673 objects misplaced (21.327%)
>             16481 active+clean
>             8762  active+remapped+backfill_wait
>             6     active+remapped+backfilling

Are you *sure* that you have both the mclock override enabled and the op 
scheduler set to wpq at the proper scope?

Note that if you’re using a wide EC profile that will gridlock the process to 
an extent.

> 
>  io:
>    client:   374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr
>    recovery: 153 MiB/s, 38 objects/s
> "
> 
> The balancer was running and seemingly making very small changes:
> 
> "
> [root@lazy ~]# ceph balancer status
> {
>    "active": true,
>    "last_optimize_duration": "0:00:01.012679",
>    "last_optimize_started": "Mon Apr 28 10:01:24 2025",
>    "mode": "upmap",
>    "no_optimization_needed": true,
>    "optimize_result": "Optimization plan created successfully",
>    "plans": []
> }
> "

The balancer has a misplaced % above which it won’t make additional changes, 
that defaults I think to 5%.  With 21% misplaced the balancer will be on hold.

> 
> 
> This is going to take a while, any tips on how to escape the apparent 
> bottleneck?

Try raising 

osd_recovery_max_active
osd_recovery_max_single_start
osd_max_backfills

to 2 or even 3.  I have no empirical evidence but I’ve observed that when 
changing back to wpq that somewhat higher than customary values for these may 
be needed to be effective.  Restarting the OSDs one failure domain at a time, 
waiting for recovery, might help according to some references.

> 
> Is having many PGs misplaced actually counter productive

Not so much unless you’re severely low on RAM I think, but I would suggest 
upmap-remapped to vanish the misplaced PGs and let the balancer do it 
incrementally.  If you have 21% misplaced pgremapper may not have worked as 
expected - I have never used it, but upmap-remapped has worked well for me, 
usually needing 2-3 successive runs.


> I was thinking it was better to let the balancer balance all it could, as 
> that would make all the moves available and decrease the risk of 
> bottlenecking.

Wise choice.

> 
> Thanks.
> 
> Mvh.
> 
> Torkil
> 
> -- 
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance
> Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: tor...@drcmr.dk
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Reply via email to