[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Michel Jouvin Thu, 27 Mar 2025 10:53:25 -0700

Hello,

It seems we are at the end of our stressful adventure! After the bigrebalancing finished, without errors but without any significant impacton the pool access problem, we decided to reboot all our OSD servers oneby one. The first good news is that it cleared all the reported issues(MDS complaining about a damages rank, slow ops...) and we were able torecover the access to all pools and filesystems. The second good news isthat the OSD server reboot triggered no new rebalancing meaning that thePG placement is again stable.

Ceph is sometimes stressful but demonstrated again it is a robuststorage platform as the data have not been in danger at anytime (justthe access)! With a great community to support us! Thanks!

That said, Frédéric and other experts, do you think it is worth to do apost-mortem analysis to understand how we ended up in such a mess afteran incident that looked somewhat trivial (a few OSD crashing)?


Best regards,

Michel

Le 27/03/2025 à 14:26, Frédéric Nass a écrit :

Michel,

I can't recall any situations like that - maybe someone here does? - but I 
would advise that you restart all OSDs to trigger the re-peering of every PG. 
This should get your cluster back on track.

Just make sure the crush map / crush rules / bucket weights (including OSDs 
weights) haven't changed, as this would of course trigger rebalancing.

Regards,
Frédéric.

----- Le 27 Mar 25, à 13:30, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

Frédéric,

When I was writing the last email, my colleague launched a re-peering of
the PG in activating state: the PG became active immediately but
triggered a little bit of rebalancing of other PGs, not necessarily in
the same pool. After this success, we decided to go for your approach,
selected a not too critical pool and did a repeer command on all the
pool PGs. This resulted in a huge rebalancing (5M objects, in progress,
affecting many pools), basically a rebalancing similar (in size) to the
unexpected one we have seen after the incident 2 days ago. Could it mean
that the state of some OSD was improperly set/used after the restart of
OSD servers after the incident and may have resulted in an inappropriate
placement of PGs that is being currently fixed after the repeer command
causes a reevaluation of the crush map ?

Cheers,

Michel

Le 27/03/2025 à 12:16, Michel Jouvin a écrit :

Frédéric,

Thanks for your answer. I checked the number of PG on osd.17: it is
164, very far from the hard limit (750, the default I think). So it
doesn't seem to be the problem and may be the peering is a victim of
the more general problem leading to many pools to be more or less
inaccessible. What inaccessible means here is not entirely clear:

- We tested the ability to access the pool content with 'rados ls' as
I said and we considered that a pool was inaccessible when the command
was timing out after 10s (no explicit error). This happens also on
empty pools.

- At the same time, on one such pool at least, we were able to
successfully upload and download a large file with a S3 client (this
pool is part of the data pool of a Swift RGW).

To be honest we have not checked all the logs yet! We concentrated
mainly on the mon logs but we'll have a look to some OSD logs.

As for restarting daemons, I am not so reluctant to do it. I have the
feeling that in the absence of any message related to inconsistencies,
there is no real risk if we restart them one by one and check with
ok-to-stop before doing it. What's your feeling? Is it worth
restarting the 3 mon first (one by one)?

You mention as an alternative re-peering all PGs of one pool.I was not
aware we could do it but I see that there is a 'ceph pg repeer'
command. Anything else we should do before running the command? Does
it make sense to try it on the PG stucked in activating+remapped state?

Best regards,

Michel

Le 27/03/2025 à 11:40, Frédéric Nass a écrit :

echo "`ceph config get osd.0 mon_max_pg_per_osd`*`ceph config get
osd.0 osd_max_pg_per_osd_hard_ratio`" | bc

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Reply via email to