[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Frédéric Nass Thu, 27 Mar 2025 06:27:20 -0700

Michel,

I can't recall any situations like that - maybe someone here does? - but I 
would advise that you restart all OSDs to trigger the re-peering of every PG. 
This should get your cluster back on track.


Just make sure the crush map / crush rules / bucket weights (including OSDs 
weights) haven't changed, as this would of course trigger rebalancing.

Regards,
Frédéric.

----- Le 27 Mar 25, à 13:30, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Frédéric,
> 
> When I was writing the last email, my colleague launched a re-peering of
> the PG in activating state: the PG became active immediately but
> triggered a little bit of rebalancing of other PGs, not necessarily in
> the same pool. After this success, we decided to go for your approach,
> selected a not too critical pool and did a repeer command on all the
> pool PGs. This resulted in a huge rebalancing (5M objects, in progress,
> affecting many pools), basically a rebalancing similar (in size) to the
> unexpected one we have seen after the incident 2 days ago. Could it mean
> that the state of some OSD was improperly set/used after the restart of
> OSD servers after the incident and may have resulted in an inappropriate
> placement of PGs that is being currently fixed after the repeer command
> causes a reevaluation of the crush map ?
> 
> Cheers,
> 
> Michel
> 
> Le 27/03/2025 à 12:16, Michel Jouvin a écrit :
>> Frédéric,
>>
>> Thanks for your answer. I checked the number of PG on osd.17: it is
>> 164, very far from the hard limit (750, the default I think). So it
>> doesn't seem to be the problem and may be the peering is a victim of
>> the more general problem leading to many pools to be more or less
>> inaccessible. What inaccessible means here is not entirely clear:
>>
>> - We tested the ability to access the pool content with 'rados ls' as
>> I said and we considered that a pool was inaccessible when the command
>> was timing out after 10s (no explicit error). This happens also on
>> empty pools.
>>
>> - At the same time, on one such pool at least, we were able to
>> successfully upload and download a large file with a S3 client (this
>> pool is part of the data pool of a Swift RGW).
>>
>> To be honest we have not checked all the logs yet! We concentrated
>> mainly on the mon logs but we'll have a look to some OSD logs.
>>
>> As for restarting daemons, I am not so reluctant to do it. I have the
>> feeling that in the absence of any message related to inconsistencies,
>> there is no real risk if we restart them one by one and check with
>> ok-to-stop before doing it. What's your feeling? Is it worth
>> restarting the 3 mon first (one by one)?
>>
>> You mention as an alternative re-peering all PGs of one pool.I was not
>> aware we could do it but I see that there is a 'ceph pg repeer'
>> command. Anything else we should do before running the command? Does
>> it make sense to try it on the PG stucked in activating+remapped state?
>>
>> Best regards,
>>
>> Michel
>>
>> Le 27/03/2025 à 11:40, Frédéric Nass a écrit :
>>> echo "`ceph config get osd.0 mon_max_pg_per_osd`*`ceph config get
> >> osd.0 osd_max_pg_per_osd_hard_ratio`" | bc
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Reply via email to