Hi Bukhard,
Thanks for your answer. Your explanation seems to match well our
observations, in particular the fact that new misplaced objects are
added when we fall under something like 0.5% of misplaced objects. What
is not clear for me anyway is that 'ceph osd pool ls detail' for the
pool modified is not reporting the new pg_num target (2048) but the old
one (256):
pool 62 'ias-z1.rgw.buckets.data' erasure profile k9_m6_host size 15
min_size 10 crush_rule 3 object_hash rjenkins pg_num 323 pgp_num 307
pg_num_target 256 pgp_num_target 256 autoscale_mode off last_change
439681 lfor 0/439680/439678 flags hashpspool,bulk max_bytes
200000000000000 stripe_width 36864 application rgw
- Is this caused by the fact that autoscaler was still on when I
increased the number of PG and that I disabled it on the pool ~12h after
entering the command to extend it?
- Or was it a mistake of mine to enter only extend the pg_num and not
the pgp_num. According to the doc that I just read again, both should be
extended at the same time or it not causing the expected result? If it
is the case, should I just reenter the command to extend pg_num and
pgp_num? (and wait for the resulting remapping!).
Best regards,
Michel
Le 01/04/2025 à 09:32, Burkhard Linke a écrit :
Hi,
On 4/1/25 09:06, Michel Jouvin wrote:
Hi,
We are observing a new strange behaviour on our production cluster :
we increased the number of PG (from 256 to 2048) in a (EC) pool after
a warning that there was a very high number of objects per pool (the
pool has 52M objects).
Background: this happens in the cluster that had a strange problem
last week, discussed in the thread "Production cluster in bad shape
after several OSD crashes". The PG increase was done after the
cluster returned to a normal state.
The increase of the number of PG resulted in 20% misplaced objects
and ~160 PG remapped (over 256). As there is no much user activity on
this cluster (except on this pool) these days, we decided to set the
mclock profile to high_recovery_ops. We also disabled autoscaler on
this pool (it was enabled and it is not clear why we add the warning
with autoscaler enabled). The pool was created with --bulk.
The remapping went steadily during 2-3 days (as much as we can tell)
but when reaching the end (between 0.5% and 1% of misplaced objects,
~10 PG remapped) it readds remapped PG (that can be seen with 'ceph
pg dump_stuck'), all belonging to the pool affected by the increase.
This already happened 3-4 times and it is very unclear why. There was
no specific problem reported on the cluster that may explain this (no
OSD down). I was wondering if the balancer may be responsible for
this but I don't have the feeling it is the case: first the balancer
doesn't report about doing something (but I may miss the history),
second the balancer would probably affect PG from different pools.
(there are 50 ppols in the cluster). There are 2 warnings that may or
may not be related:
*snipsnap*
you are probably seeing the loadbalancer in action. If you increase
the number of PGs, this change is performed gradually. You can check
this by having a look at the output of 'ceph osd pool ls detail'. It
will print the current number of PGs _and_ PGPs, and the target number
of both values. If you increase the number of PG, existing PGs have to
be splitted and a part of the data has to be transferred to the new
PG. The loadbalancer will take care for this and start the next PG
split if the number of misplaced objects falls under a certain
threshold. The default value is 0.5% is I remember correctly.
- All mon have their data size > 15 GB (mon_data_size_warn).
Currently 16 GB. It happened progressively during the remapping on
all 3 mon, I guess it is due to the operation in progress and is
harmless. Do you confirm?
The mons have to save historic osd and pg maps as long as the cluster
is not healthy. During large data movement operations this might pile
up to a significant size. You should monitor the available space and
ensure that the mons are not running out of disk space. I'm not sure
whether manual intervention like a manual compaction will help in this
case
- Since yesterday we have 4 PG that have not been deep scrubbed in
time, belonging to different pools. Again, I tend to correlate this
to the remapping in progress putting too much load (or other
constraints), as there are a lot of deep scrubs per day. The current
age distribution of deep scrubs is:
4 "2025-03-19
21 "2025-03-20
46 "2025-03-21
35 "2025-03-22
81 "2025-03-23
597 "2025-03-24
1446 "2025-03-25
2234 "2025-03-26
2256 "2025-03-27
1625 "2025-03-28
1980 "2025-03-29
2993 "2025-03-30
3871 "2025-03-31
1113 "2025-04-01
Should we worry about the situation? If yes, what would you advise to
look at or do? To clear the problem last week, we had to restart all
OSD but we didn't restart mon. Do they play a role in deciding the
remapping plan? Is restarting them something that may help?
Remapping PGs has a higher priority than scrubbing. So as long as the
current pool extension is not finished, only idle OSDs will scrub
their PGs. This is expected, and the cluster will take care for the
missing scrub runs after it is healthy again.
Best regards,
Burkhard Linke
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io