[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Michel Jouvin Thu, 27 Mar 2025 02:44:18 -0700

Hi,

I have not seen an answer yet, help would be very much appreciated asour production cluster seems in a worst shape that initially described...

After a deeper analysis, we found that more than half of the pools,despite reported as ok, are not accessible: the 'rados ls' command isstuck when we try to access them. It is not correlated to the EC versus3 replica pool configuration (both are affected or can be ok). We don'thave clear view whether we have a corruption problem (but it is unclearwhy it would have happened) or a communication problem between somecluster components that may explain that pools reported as good are notaccessible and that one PG remains in the activating+remapped state,something that seems unusual (I could not find any reference to thiswith my Google searches).

Then we are hesitating between forcing a deep scrub of everything (wehave not seen any errors reported by scrubs/deep scrubs run in the lastdays) or restarting the whole cluster in case there is a kind ofdeadlock in the communication between some mon and/or osd. In case thesecond approach is suggested (cluster restart), should we do it daemonby daemon or shutdown everything and do a cold restart of the cluster?

I don't want to clutter this already too long thread with too manydetails but one of my colleague gave me the 'ceph -s' and 'ceph osdstatus` output before he started to reboot servers. If it is useful, Ican share them.


Again, thanks in advance for any help/hint.

Best regards,

Michel

Le 26/03/2025 à 21:54, Michel Jouvin a écrit :

And sorry for all these mails, I forgot to mention that we are running18.2.2.
Michel

Le 26/03/2025 à 21:51, Michel Jouvin a écrit :
Hi again,
Looking for more info on the degraded filesystem, I managed toconnect to the dashboard where I see an error not reported asexplicitely by 'ceph health' :
One or more metadata daemons (MDS ranks) are failed or in a damagedstate. At best the filesystem is partially available, at worst thefilesystem is completely unusable.
But I don't manage what can be done from this point... and I reallydon't understand how we ended up in such a state...
Cheers,

Michel

Le 26/03/2025 à 21:27, Michel Jouvin a écrit :
Hi,
We have a production cluster made of 3 mon+mgr, 18 OSD servers and~500 OSDs and configured with ~50 pools, 1/2 EC (9+6) and 1/2replica 3. It also has 2 CephFS filesystems with 1 MDS each.
2 days ago, in a period spanning 16 hours, 13 OSD crashed with anOOM. The OSD were first restarted but it was decided to reboot theserver with a crashed OSD and "by mistake" (it was at leastuseless), the OSD of the rebooted server were set noout,norebalancebefore the reboot. The flags were removed after the reboot.
After all of this, 'ceph -s' started to report a lot of misplaced PGand recovery started. All the PGs but one were successfullyreactivated. One stayed in the activating+remapped state (located ina pool used for tests). 'ceph health' (I don't put the details hereto avoid a too long mail but I can shared them) says:
HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 2MDSs report slow metadata IOs; Reduced data availability: 1 pginactive; 13 daemons have recently crashed
and reports about one of the filesystem being degraded despite theonly PG reported inactive is not part of a pool related to the FS.
The recovery was slow until we realized we should change the mclockprofile to high_recovery_ops. Then it completed in a few hours.Unfortunately the degraded filesystem remains degraded without anobvious reason... and the inactive page is still in theactivating+remapped state. We have not been able to identify arelevant error in the logs up to now (but we may have missedsomething...).
So far we have avoided restarting too many things until we have abetter understanding of what happened and what is the current state.We only restarted the mgr which was using a lot of CPU and the MDSfor the degraded FS, without any improvement.
We are looking on advices about where to start... It seems we have(at least) 2 independent problems:
- A PG that cannot be reactivated with a remap operation thatdoesn't proceed: would stopping osd.17 help (so that osd.460 isreused)?
[root@ijc-mon1 ~]# ceph pg dump_stuck
PG_STAT STATE UP UP_PRIMARYACTING ACTING_PRIMARY32.7ef activating+remapped [100,154,17] 100[100,154,460] 100
- 1 degraded filesystem: where to look for a reason?

Thanks in advance for any help?

Cheers,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Production cluster in bad shape after several OSD crashes

Reply via email to