[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Frédéric Nass Fri, 28 Mar 2025 03:11:04 -0700

Hi Michel,

----- Le 27 Mar 25, à 18:52, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :


> Hello,
> 
> It seems we are at the end of our stressful adventure!

Awesome!

> After the big
> rebalancing finished, without errors but without any significant impact
> on the pool access problem, we decided to reboot all our OSD servers one
> by one. The first good news is that it cleared all the reported issues
> (MDS complaining about a damages rank, slow ops...) and we were able to
> recover the access to all pools and filesystems. The second good news is
> that the OSD server reboot triggered no new rebalancing meaning that the
> PG placement is again stable.
> 
> Ceph is sometimes stressful but demonstrated again it is a robust
> storage platform as the data have not been in danger at anytime (just
> the access)! 

Indeed.

>With a great community to support us! Thanks!

I agree.

> That said, Frédéric and other experts, do you think it is worth to do a
> post-mortem analysis

Well nobody died in this episode :-)

> to understand how we ended up in such a mess after
> an incident that looked somewhat trivial (a few OSD crashing)?

Yes, investigating the sequence of events (at what time each OSD crashed, if 
some of them crashed simultaneously) and the state of the cluster at each time 
would surely help to understand what happened.
Most importantly, you would better understand why some OSDs were OOM-killed by 
the kernel, to avoid this in the future. Are you using swap on your OSD nodes?

Regards,
Frédéric.

> 
> Best regards,
> 
> Michel
> 
> Le 27/03/2025 à 14:26, Frédéric Nass a écrit :
>> Michel,
>>
>> I can't recall any situations like that - maybe someone here does? - but I 
>> would
>> advise that you restart all OSDs to trigger the re-peering of every PG. This
>> should get your cluster back on track.
>>
>> Just make sure the crush map / crush rules / bucket weights (including OSDs
>> weights) haven't changed, as this would of course trigger rebalancing.
>>
>> Regards,
>> Frédéric.
>>
>> ----- Le 27 Mar 25, à 13:30, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
>> écrit
>> :
>>
>>> Frédéric,
>>>
>>> When I was writing the last email, my colleague launched a re-peering of
>>> the PG in activating state: the PG became active immediately but
>>> triggered a little bit of rebalancing of other PGs, not necessarily in
>>> the same pool. After this success, we decided to go for your approach,
>>> selected a not too critical pool and did a repeer command on all the
>>> pool PGs. This resulted in a huge rebalancing (5M objects, in progress,
>>> affecting many pools), basically a rebalancing similar (in size) to the
>>> unexpected one we have seen after the incident 2 days ago. Could it mean
>>> that the state of some OSD was improperly set/used after the restart of
>>> OSD servers after the incident and may have resulted in an inappropriate
>>> placement of PGs that is being currently fixed after the repeer command
>>> causes a reevaluation of the crush map ?
>>>
>>> Cheers,
>>>
>>> Michel
>>>
>>> Le 27/03/2025 à 12:16, Michel Jouvin a écrit :
>>>> Frédéric,
>>>>
>>>> Thanks for your answer. I checked the number of PG on osd.17: it is
>>>> 164, very far from the hard limit (750, the default I think). So it
>>>> doesn't seem to be the problem and may be the peering is a victim of
>>>> the more general problem leading to many pools to be more or less
>>>> inaccessible. What inaccessible means here is not entirely clear:
>>>>
>>>> - We tested the ability to access the pool content with 'rados ls' as
>>>> I said and we considered that a pool was inaccessible when the command
>>>> was timing out after 10s (no explicit error). This happens also on
>>>> empty pools.
>>>>
>>>> - At the same time, on one such pool at least, we were able to
>>>> successfully upload and download a large file with a S3 client (this
>>>> pool is part of the data pool of a Swift RGW).
>>>>
>>>> To be honest we have not checked all the logs yet! We concentrated
>>>> mainly on the mon logs but we'll have a look to some OSD logs.
>>>>
>>>> As for restarting daemons, I am not so reluctant to do it. I have the
>>>> feeling that in the absence of any message related to inconsistencies,
>>>> there is no real risk if we restart them one by one and check with
>>>> ok-to-stop before doing it. What's your feeling? Is it worth
>>>> restarting the 3 mon first (one by one)?
>>>>
>>>> You mention as an alternative re-peering all PGs of one pool.I was not
>>>> aware we could do it but I see that there is a 'ceph pg repeer'
>>>> command. Anything else we should do before running the command? Does
>>>> it make sense to try it on the PG stucked in activating+remapped state?
>>>>
>>>> Best regards,
>>>>
>>>> Michel
>>>>
>>>> Le 27/03/2025 à 11:40, Frédéric Nass a écrit :
>>>>> echo "`ceph config get osd.0 mon_max_pg_per_osd`*`ceph config get
> >>>> osd.0 osd_max_pg_per_osd_hard_ratio`" | bc
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Reply via email to