We're trying to determine the root cause of a CephFS outage. We have three
MDS ranks with active-standby.

During the outage, several MDSs crashed. The timeline of the crashes was:

2025-04-13T14:19:45 mds.r-cephfs-hdd-f on node06.internal
2025-04-13T14:38:35 mds.r-cephfs-hdd-a on node02.internal
2025-04-13T14:38:37 mds.r-cephfs-hdd-b on node07.internal
2025-04-13T14:38:38 mds.r-cephfs-hdd-d on node05.internal
2025-04-13T14:48:52 mds.r-cephfs-hdd-e on node08.internal
2025-04-13T14:54:12 mds.r-cephfs-hdd-f on node06.internal

At around 3pm, the file system recovered by itself.

For all six crashes, the MDS logged this as the reason of the crash:

ceph-mds:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.1/rpm/el9/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201:
T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]:
Assertion `px != 0' failed.

For the earliest crash, here's the `ceph crash info` output:

bash-5.1$ ceph crash info
2025-04-13T14:19:45.645607Z_7e6475e0-9a22-4e3f-a282-0ab02a7c972c
{
    "backtrace": [
        "/lib64/libc.so.6(+0x3e930) [0x7f4990b08930]",
        "/lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]",
        "raise()",
        "abort()",
        "/lib64/libc.so.6(+0x2875b) [0x7f4990af275b]",
        "/lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]",
        "ceph-mds(+0x1c2829) [0x559da9485829]",
        "ceph-mds(+0x3191a7) [0x559da95dc1a7]",
        "(MDSContext::complete(int)+0x5c) [0x559da971117c]",
        "(Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]",
        "/lib64/libc.so.6(+0x8a292) [0x7f4990b54292]",
        "/lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]"
    ],
    "ceph_version": "19.2.1",
    "crash_id":
"2025-04-13T14:19:45.645607Z_7e6475e0-9a22-4e3f-a282-0ab02a7c972c",
    "entity_name": "mds.r-cephfs-hdd-f",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "9",
    "os_version_id": "9",
    "process_name": "ceph-mds",
    "stack_sig":
"8975f8e99bd02b53c8d37ce7cc9e85dc5d4898104a0949d0829819a753123f18",
    "timestamp": "2025-04-13T14:19:45.645607Z",
    "utsname_hostname": "node06.internal",
    "utsname_machine": "x86_64",
    "utsname_release": "6.6.83-flatcar",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Mar 17 16:07:40 -00 2025"
}

And here's an excerpt of the logs, starting with the aforementioned
assertion failure:

ceph-mds:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.1/rpm/el9/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201:
T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]:
Assertion `px != 0' failed.
*** Caught signal (Aborted) **
 in thread 7f4984f30640 thread_name:
 ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid
(stable)
 1: /lib64/libc.so.6(+0x3e930) [0x7f4990b08930]
 2: /lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x2875b) [0x7f4990af275b]
 6: /lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]
 7: ceph-mds(+0x1c2829) [0x559da9485829]
 8: ceph-mds(+0x3191a7) [0x559da95dc1a7]
 9: (MDSContext::complete(int)+0x5c) [0x559da971117c]
 10: (Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]
 11: /lib64/libc.so.6(+0x8a292) [0x7f4990b54292]
 12: /lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]
debug 2025-04-13T14:19:45.645+0000 7f4984f30640 -1 *** Caught signal
(Aborted) **
 in thread 7f4984f30640 thread_name:
 ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid
(stable)
 1: /lib64/libc.so.6(+0x3e930) [0x7f4990b08930]
 2: /lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x2875b) [0x7f4990af275b]
 6: /lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]
 7: ceph-mds(+0x1c2829) [0x559da9485829]
 8: ceph-mds(+0x3191a7) [0x559da95dc1a7]
 9: (MDSContext::complete(int)+0x5c) [0x559da971117c]
 10: (Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]
 11: /lib64/libc.so.6(+0x8a292) [0x7f4990b54292]
 12: /lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
--- begin dump of recent events ---
debug  -9999> 2025-04-13T14:17:22.696+0000 7f4986f34640  5 mds.2.log trim
already expired LogSegment(79533796/0x1a5ea43a405 events=117)
debug  -9998> 2025-04-13T14:17:22.696+0000 7f4986f34640  5 mds.2.log trim
already expired LogSegment(79533913/0x1a5ea8350c6 events=75)
debug  -9997> 2025-04-13T14:17:22.696+0000 7f4986f34640  5 mds.2.log trim
already expired LogSegment(79533988/0x1a5eac5c3df events=95)
...

Do you have ideas what could cause these crashes or how we could
troubleshoot further? We're happy to provide more information if that'd
help.

Simon
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to