[ceph-users] MDS crashing on startup

Frank Schilder Tue, 14 Jan 2025 06:32:05 -0800

Hi Dan, hi all,

this is related to the thread "Help needed, ceph fs down due to large stray 
dir". We deployed a bare metal host for debugging ceph daemon issues, here, to 
run "perf top" to find out where our MDS becomes unresponsive. Unfortunately, 
we encounter a strange issue:


The bare-metal MDS crashes very quickly during the initial reconnect phase:

   -61> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active 
yet, waiting
   -60> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch 
request in up:reconnect: client_request(client.425250501:594886 lookup 
#0x3001059de1e/02JanParetoR
esults_n5_a7_m3 2025-01-13T15:25:32.427929-0500 RETRY=6 caller_uid=315104, 
caller_gid=315104{}) v2
   -59> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active 
yet, waiting
   -58> 2025-01-14T08:59:47.202-0500 7f6771d20700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MAC
HINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph-16.2.15/src/msg/async/AsyncMessenger.cc:
 In function 'void Processor::accept()' thread 7f6771d20700 time 2025-01-14T08:
59:47.200795-0500
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.15/rpm/el8/BUILD/ceph
-16.2.15/src/msg/async/AsyncMessenger.cc: 214: ceph_abort_msg("abort() called")

 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific 
(stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
const&)+0xe5) [0x7f6776f6e904]
 2: (Processor::accept()+0x862) [0x7f6777261502]
 3: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned 
long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7f67772b6b87]
 4: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
 5: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
 6: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
 7: clone()

   -57> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log 
[DBG] : reconnect by client.425250501 v1:192.168.57.60:0/1003106369 after 
1.01409
...
    -3> 2025-01-14T08:59:47.202-0500 7f676e519700  5 mds.2.server dispatch 
request in up:reconnect: client_request(client.42561
2912:251430 lookup #0x10000f5568d/util-linux 2025-01-11T17:43:34.212128-0500 
RETRY=8 caller_uid=298337, caller_gid=298337{}) v2
    -2> 2025-01-14T08:59:47.202-0500 7f676e519700  3 mds.2.server not active 
yet, waiting
    -1> 2025-01-14T08:59:47.202-0500 7f676e519700  0 log_channel(cluster) log 
[DBG] : reconnect by client.425612912 v1:192.168.
58.11:0/4294630612 after 1.01409
     0> 2025-01-14T08:59:47.203-0500 7f6771d20700 -1 *** Caught signal 
(Aborted) **
 in thread 7f6771d20700 thread_name:msgr-worker-0

 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific 
(stable)
 1: /lib64/libpthread.so.0(+0x12d10) [0x7f6775f55d10]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<
char> > const&)+0x1b6) [0x7f6776f6e9d5]
 5: (Processor::accept()+0x862) [0x7f6777261502]
 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned 
long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x7
f67772b6b87]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]
 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7f6775380b23]
 9: /lib64/libpthread.so.0(+0x81ca) [0x7f6775f4b1ca]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

All other log messages look normal. The strange thing is that the bare metal 
and the containerized MDSes are the same version, yet the containerized daemon 
does *not* crash. Versions are:

bare-metal# ceph-mds --version
ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)

container# ceph-mds --version
ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)

Both binaries have the same md5 sum as well. The only possibly relevant 
difference might be the kernel version:

bare-metal: 4.18.0-553.34.1.el8_10.x86_64
containerized: 5.14.13-1.el7.elrepo.x86_64

I also installed all sorts of debuginfo packages. Still, this symbol is not 
resolved:

 7: /usr/lib64/ceph/libceph-common.so.2(+0x5c90bc) [0x7f67772bd0bc]

Which package is it in? I did install 
ceph-common-debuginfo-16.2.15-0.el8.x86_64.rpm .

For installing the MDS we followed the instructions for manual start here: 
https://docs.ceph.com/en/pacific/install/manual-deployment/#adding-mds .

Thanks for any pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS crashing on startup

Reply via email to