We're trying to determine the root cause of a CephFS outage. We have three
MDS ranks with active-standby.
During the outage, several MDSs crashed. The timeline of the crashes was:
2025-04-13T14:19:45 mds.r-cephfs-hdd-f on node06.internal
2025-04-13T14:38:35 mds.r-cephfs-hdd-a on node02.internal
2
Hi Dan, hi all,
this is related to the thread "Help needed, ceph fs down due to large stray
dir". We deployed a bare metal host for debugging ceph daemon issues, here, to
run "perf top" to find out where our MDS becomes unresponsive. Unfortunately,
we encounter a strange issue:
The bare-metal
Hi all,
at the University of Zurich we run a cephfs cluster of ~12PB raw size.
We currently run Pacific 16.2.15 and our clients (Ubuntu 20.04) mount cephfs
using the kernel driver.
The cluster was deployed in Mimic and subsequently to Nautilus (14.2.22) and
then Pacific (16.2.15).
Last Wednesda
Hi,
I have a small cluster with 11 osds and 4 filesystems. Each server
(Debian 11, ceph 17.2.7) usually run several services.
After troubles with a host with OSD:s I removed the OSD:s and let the
cluster repair it self (x3 replica). After a while it returned to a
healthy state and everything
Hi,
I have a 18.2.0 Ceph cluster and my MDS are now crashing repeatedly.
After a few automatic restart, every MDS is removed and only one stays
active. But it's flagged "laggy" and I can't even start a scrub on it.
In the log I have this during crashes:
Dec 13 15:54:02 ceph04
ceph-ff6e50de-