[ceph-users] Re: endless remapping after increasing number of PG in a pool

Michel Jouvin Tue, 01 Apr 2025 21:31:33 -0700

Hi Bukhard,

Thanks for your answer. Your explanation seems to match well ourobservations, in particular the fact that new misplaced objects areadded when we fall under something like 0.5% of misplaced objects. Whatis not clear for me anyway is that 'ceph osd pool ls detail' for thepool modified is not reporting the new pg_num target (2048) but the oldone (256):

pool 62 'ias-z1.rgw.buckets.data' erasure profile k9_m6_host size 15min_size 10 crush_rule 3 object_hash rjenkins pg_num 323 pgp_num 307pg_num_target 256 pgp_num_target 256 autoscale_mode off last_change439681 lfor 0/439680/439678 flags hashpspool,bulk max_bytes200000000000000 stripe_width 36864 application rgw

- Is this caused by the fact that autoscaler was still on when Iincreased the number of PG and that I disabled it on the pool ~12h afterentering the command to extend it?

- Or was it a mistake of mine to enter only extend the pg_num and notthe pgp_num. According to the doc that I just read again, both should beextended at the same time or it not causing the expected result? If itis the case, should I just reenter the command to extend pg_num andpgp_num? (and wait for the resulting remapping!).


Best regards,

Michel


Le 01/04/2025 à 09:32, Burkhard Linke a écrit :

Hi,

On 4/1/25 09:06, Michel Jouvin wrote:
Hi,
We are observing a new strange behaviour on our production cluster :we increased the number of PG (from 256 to 2048) in a (EC) pool aftera warning that there was a very high number of objects per pool (thepool has 52M objects).
Background: this happens in the cluster that had a strange problemlast week, discussed in the thread "Production cluster in bad shapeafter several OSD crashes". The PG increase was done after thecluster returned to a normal state.
The increase of the number of PG resulted in 20% misplaced objectsand ~160 PG remapped (over 256). As there is no much user activity onthis cluster (except on this pool) these days, we decided to set themclock profile to high_recovery_ops. We also disabled autoscaler onthis pool (it was enabled and it is not clear why we add the warningwith autoscaler enabled). The pool was created with --bulk.
The remapping went steadily during 2-3 days (as much as we can tell)but when reaching the end (between 0.5% and 1% of misplaced objects,~10 PG remapped) it readds remapped PG (that can be seen with 'cephpg dump_stuck'), all belonging to the pool affected by the increase.This already happened 3-4 times and it is very unclear why. There wasno specific problem reported on the cluster that may explain this (noOSD down). I was wondering if the balancer may be responsible forthis but I don't have the feeling it is the case: first the balancerdoesn't report about doing something (but I may miss the history),second the balancer would probably affect PG from different pools.(there are 50 ppols in the cluster). There are 2 warnings that may ormay not be related:
*snipsnap*
you are probably seeing the loadbalancer in action. If you increasethe number of PGs, this change is performed gradually. You can checkthis by having a look at the output of 'ceph osd pool ls detail'. Itwill print the current number of PGs _and_ PGPs, and the target numberof both values. If you increase the number of PG, existing PGs have tobe splitted and a part of the data has to be transferred to the newPG. The loadbalancer will take care for this and start the next PGsplit if the number of misplaced objects falls under a certainthreshold. The default value is 0.5% is I remember correctly.
- All mon have their data size > 15 GB (mon_data_size_warn).Currently 16 GB. It happened progressively during the remapping onall 3 mon, I guess it is due to the operation in progress and isharmless. Do you confirm?
The mons have to save historic osd and pg maps as long as the clusteris not healthy. During large data movement operations this might pileup to a significant size. You should monitor the available space andensure that the mons are not running out of disk space. I'm not surewhether manual intervention like a manual compaction will help in thiscase
- Since yesterday we have 4 PG that have not been deep scrubbed intime, belonging to different pools. Again, I tend to correlate thisto the remapping in progress putting too much load (or otherconstraints), as there are a lot of deep scrubs per day. The currentage distribution of deep scrubs is:
      4 "2025-03-19
     21 "2025-03-20
     46 "2025-03-21
     35 "2025-03-22
     81 "2025-03-23
    597 "2025-03-24
   1446 "2025-03-25
   2234 "2025-03-26
   2256 "2025-03-27
   1625 "2025-03-28
   1980 "2025-03-29
   2993 "2025-03-30
   3871 "2025-03-31
   1113 "2025-04-01
Should we worry about the situation? If yes, what would you advise tolook at or do? To clear the problem last week, we had to restart allOSD but we didn't restart mon. Do they play a role in deciding theremapping plan? Is restarting them something that may help?
Remapping PGs has a higher priority than scrubbing. So as long as thecurrent pool extension is not finished, only idle OSDs will scrubtheir PGs. This is expected, and the cluster will take care for themissing scrub runs after it is healthy again.
Best regards,

Burkhard Linke

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: endless remapping after increasing number of PG in a pool

Reply via email to