We're trying to determine the root cause of a CephFS outage. We have three MDS ranks with active-standby.
During the outage, several MDSs crashed. The timeline of the crashes was: 2025-04-13T14:19:45 mds.r-cephfs-hdd-f on node06.internal 2025-04-13T14:38:35 mds.r-cephfs-hdd-a on node02.internal 2025-04-13T14:38:37 mds.r-cephfs-hdd-b on node07.internal 2025-04-13T14:38:38 mds.r-cephfs-hdd-d on node05.internal 2025-04-13T14:48:52 mds.r-cephfs-hdd-e on node08.internal 2025-04-13T14:54:12 mds.r-cephfs-hdd-f on node06.internal At around 3pm, the file system recovered by itself. For all six crashes, the MDS logged this as the reason of the crash: ceph-mds: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.1/rpm/el9/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201: T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]: Assertion `px != 0' failed. For the earliest crash, here's the `ceph crash info` output: bash-5.1$ ceph crash info 2025-04-13T14:19:45.645607Z_7e6475e0-9a22-4e3f-a282-0ab02a7c972c { "backtrace": [ "/lib64/libc.so.6(+0x3e930) [0x7f4990b08930]", "/lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]", "raise()", "abort()", "/lib64/libc.so.6(+0x2875b) [0x7f4990af275b]", "/lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]", "ceph-mds(+0x1c2829) [0x559da9485829]", "ceph-mds(+0x3191a7) [0x559da95dc1a7]", "(MDSContext::complete(int)+0x5c) [0x559da971117c]", "(Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]", "/lib64/libc.so.6(+0x8a292) [0x7f4990b54292]", "/lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]" ], "ceph_version": "19.2.1", "crash_id": "2025-04-13T14:19:45.645607Z_7e6475e0-9a22-4e3f-a282-0ab02a7c972c", "entity_name": "mds.r-cephfs-hdd-f", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "9", "os_version_id": "9", "process_name": "ceph-mds", "stack_sig": "8975f8e99bd02b53c8d37ce7cc9e85dc5d4898104a0949d0829819a753123f18", "timestamp": "2025-04-13T14:19:45.645607Z", "utsname_hostname": "node06.internal", "utsname_machine": "x86_64", "utsname_release": "6.6.83-flatcar", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Mar 17 16:07:40 -00 2025" } And here's an excerpt of the logs, starting with the aforementioned assertion failure: ceph-mds: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.1/rpm/el9/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201: T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]: Assertion `px != 0' failed. *** Caught signal (Aborted) ** in thread 7f4984f30640 thread_name: ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid (stable) 1: /lib64/libc.so.6(+0x3e930) [0x7f4990b08930] 2: /lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc] 3: raise() 4: abort() 5: /lib64/libc.so.6(+0x2875b) [0x7f4990af275b] 6: /lib64/libc.so.6(+0x375c6) [0x7f4990b015c6] 7: ceph-mds(+0x1c2829) [0x559da9485829] 8: ceph-mds(+0x3191a7) [0x559da95dc1a7] 9: (MDSContext::complete(int)+0x5c) [0x559da971117c] 10: (Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d] 11: /lib64/libc.so.6(+0x8a292) [0x7f4990b54292] 12: /lib64/libc.so.6(+0x10f300) [0x7f4990bd9300] debug 2025-04-13T14:19:45.645+0000 7f4984f30640 -1 *** Caught signal (Aborted) ** in thread 7f4984f30640 thread_name: ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid (stable) 1: /lib64/libc.so.6(+0x3e930) [0x7f4990b08930] 2: /lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc] 3: raise() 4: abort() 5: /lib64/libc.so.6(+0x2875b) [0x7f4990af275b] 6: /lib64/libc.so.6(+0x375c6) [0x7f4990b015c6] 7: ceph-mds(+0x1c2829) [0x559da9485829] 8: ceph-mds(+0x3191a7) [0x559da95dc1a7] 9: (MDSContext::complete(int)+0x5c) [0x559da971117c] 10: (Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d] 11: /lib64/libc.so.6(+0x8a292) [0x7f4990b54292] 12: /lib64/libc.so.6(+0x10f300) [0x7f4990bd9300] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- debug -9999> 2025-04-13T14:17:22.696+0000 7f4986f34640 5 mds.2.log trim already expired LogSegment(79533796/0x1a5ea43a405 events=117) debug -9998> 2025-04-13T14:17:22.696+0000 7f4986f34640 5 mds.2.log trim already expired LogSegment(79533913/0x1a5ea8350c6 events=75) debug -9997> 2025-04-13T14:17:22.696+0000 7f4986f34640 5 mds.2.log trim already expired LogSegment(79533988/0x1a5eac5c3df events=95) ... Do you have ideas what could cause these crashes or how we could troubleshoot further? We're happy to provide more information if that'd help. Simon _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io