Frédéric,
Thanks for your answer. I checked the number of PG on osd.17: it is 164,
very far from the hard limit (750, the default I think). So it doesn't
seem to be the problem and may be the peering is a victim of the more
general problem leading to many pools to be more or less inaccessible.
What inaccessible means here is not entirely clear:
- We tested the ability to access the pool content with 'rados ls' as I
said and we considered that a pool was inaccessible when the command was
timing out after 10s (no explicit error). This happens also on empty pools.
- At the same time, on one such pool at least, we were able to
successfully upload and download a large file with a S3 client (this
pool is part of the data pool of a Swift RGW).
To be honest we have not checked all the logs yet! We concentrated
mainly on the mon logs but we'll have a look to some OSD logs.
As for restarting daemons, I am not so reluctant to do it. I have the
feeling that in the absence of any message related to inconsistencies,
there is no real risk if we restart them one by one and check with
ok-to-stop before doing it. What's your feeling? Is it worth restarting
the 3 mon first (one by one)?
You mention as an alternative re-peering all PGs of one pool.I was not
aware we could do it but I see that there is a 'ceph pg repeer' command.
Anything else we should do before running the command? Does it make
sense to try it on the PG stucked in activating+remapped state?
Best regards,
Michel
Le 27/03/2025 à 11:40, Frédéric Nass a écrit :
echo "`ceph config get osd.0 mon_max_pg_per_osd`*`ceph config get osd.0
osd_max_pg_per_osd_hard_ratio`" | bc
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io