[ceph-users] MDS crashing with "Assertion `px != 0' failed"

2025-04-15 Thread Simon Campion
We're trying to determine the root cause of a CephFS outage. We have three MDS ranks with active-standby. During the outage, several MDSs crashed. The timeline of the crashes was: 2025-04-13T14:19:45 mds.r-cephfs-hdd-f on node06.internal 2025-04-13T14:38:35 mds.r-cephfs-hdd-a on node02.internal 2

[ceph-users] MDS crashing on startup

2025-01-14 Thread Frank Schilder
Hi Dan, hi all, this is related to the thread "Help needed, ceph fs down due to large stray dir". We deployed a bare metal host for debugging ceph daemon issues, here, to run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, we encounter a strange issue: The bare-metal

[ceph-users] MDS crashing and stuck in replay(laggy) ( "batch_ops.empty()", "p->first <= start" )

2024-12-10 Thread Enrico Favero
Hi all, at the University of Zurich we run a cephfs cluster of ~12PB raw size. We currently run Pacific 16.2.15 and our clients (Ubuntu 20.04) mount cephfs using the kernel driver. The cluster was deployed in Mimic and subsequently to Nautilus (14.2.22) and then Pacific (16.2.15). Last Wednesda

[ceph-users] MDS crashing

2024-05-29 Thread Johan
Hi, I have a small cluster with 11 osds and 4 filesystems. Each server (Debian 11, ceph 17.2.7) usually run several services. After troubles with a host with OSD:s I removed the OSD:s and let the cluster repair it self (x3 replica). After a while it returned to a healthy state and everything

[ceph-users] MDS crashing repeatedly

2023-12-13 Thread Thomas Widhalm
Hi, I have a 18.2.0 Ceph cluster and my MDS are now crashing repeatedly. After a few automatic restart, every MDS is removed and only one stays active. But it's flagged "laggy" and I can't even start a scrub on it. In the log I have this during crashes: Dec 13 15:54:02 ceph04 ceph-ff6e50de-