[ceph-users] Re: endless remapping after increasing number of PG in a pool

Burkhard Linke Tue, 01 Apr 2025 09:16:37 -0700

Hi,

On 4/1/25 09:06, Michel Jouvin wrote:

Hi,
We are observing a new strange behaviour on our production cluster :we increased the number of PG (from 256 to 2048) in a (EC) pool aftera warning that there was a very high number of objects per pool (thepool has 52M objects).
Background: this happens in the cluster that had a strange problemlast week, discussed in the thread "Production cluster in bad shapeafter several OSD crashes". The PG increase was done after the clusterreturned to a normal state.
The increase of the number of PG resulted in 20% misplaced objects and~160 PG remapped (over 256). As there is no much user activity on thiscluster (except on this pool) these days, we decided to set the mclockprofile to high_recovery_ops. We also disabled autoscaler on this pool(it was enabled and it is not clear why we add the warning withautoscaler enabled). The pool was created with --bulk.
The remapping went steadily during 2-3 days (as much as we can tell)but when reaching the end (between 0.5% and 1% of misplaced objects,~10 PG remapped) it readds remapped PG (that can be seen with 'ceph pgdump_stuck'), all belonging to the pool affected by the increase. Thisalready happened 3-4 times and it is very unclear why. There was nospecific problem reported on the cluster that may explain this (no OSDdown). I was wondering if the balancer may be responsible for this butI don't have the feeling it is the case: first the balancer doesn'treport about doing something (but I may miss the history), second thebalancer would probably affect PG from different pools. (there are 50ppols in the cluster). There are 2 warnings that may or may not berelated:


*snipsnap*

you are probably seeing the loadbalancer in action. If you increase thenumber of PGs, this change is performed gradually. You can check this byhaving a look at the output of 'ceph osd pool ls detail'. It will printthe current number of PGs _and_ PGPs, and the target number of bothvalues. If you increase the number of PG, existing PGs have to besplitted and a part of the data has to be transferred to the new PG. Theloadbalancer will take care for this and start the next PG split if thenumber of misplaced objects falls under a certain threshold. The defaultvalue is 0.5% is I remember correctly.

- All mon have their data size > 15 GB (mon_data_size_warn). Currently16 GB. It happened progressively during the remapping on all 3 mon, Iguess it is due to the operation in progress and is harmless. Do youconfirm?

The mons have to save historic osd and pg maps as long as the cluster isnot healthy. During large data movement operations this might pile up toa significant size. You should monitor the available space and ensurethat the mons are not running out of disk space. I'm not sure whethermanual intervention like a manual compaction will help in this case

- Since yesterday we have 4 PG that have not been deep scrubbed intime, belonging to different pools. Again, I tend to correlate this tothe remapping in progress putting too much load (or otherconstraints), as there are a lot of deep scrubs per day. The currentage distribution of deep scrubs is:
      4 "2025-03-19
     21 "2025-03-20
     46 "2025-03-21
     35 "2025-03-22
     81 "2025-03-23
    597 "2025-03-24
   1446 "2025-03-25
   2234 "2025-03-26
   2256 "2025-03-27
   1625 "2025-03-28
   1980 "2025-03-29
   2993 "2025-03-30
   3871 "2025-03-31
   1113 "2025-04-01
Should we worry about the situation? If yes, what would you advise tolook at or do? To clear the problem last week, we had to restart allOSD but we didn't restart mon. Do they play a role in deciding theremapping plan? Is restarting them something that may help?

Remapping PGs has a higher priority than scrubbing. So as long as thecurrent pool extension is not finished, only idle OSDs will scrub theirPGs. This is expected, and the cluster will take care for the missingscrub runs after it is healthy again.



Best regards,

Burkhard Linke

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: endless remapping after increasing number of PG in a pool

Reply via email to