Hi Emmanuel,

This should be one known issue as https://tracker.ceph.com/issues/58392 and there is one fix in https://github.com/ceph/ceph/pull/49652.
Could you just stop all the clients first and then set the 'max_mds' to 
1 and then restart the MDS daemons ?
Thanks

On 5/3/23 16:01, Emmanuel Jaep wrote:
Hi,

I just inherited a ceph storage. Therefore, my level of confidence with the 
tool is certainly less than ideal.

We currently have an mds server that refuses to come back online. While 
reviewing the logs, I can see that, upon mds start, the recovery goes well:
```
    -10> 2023-05-03T08:12:43.632+0200 7f345d00b700  1 mds.4.2638711 cluster 
recovered.
  12: (MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xbf) [0x558323d602df]
```

However, rights after this message, ceph handles a couple of clients request:
```
     -9> 2023-05-03T08:12:43.632+0200 7f345d00b700  4 mds.4.2638711 
set_osd_epoch_barrier: epoch=249241
     -8> 2023-05-03T08:12:43.632+0200 7f3459003700  2 mds.4.cache Memory usage: 
 total 2739784, rss 2321188, heap 348412, baseline 315644, 0 / 765023 inodes have 
caps, 0 caps, 0 caps per inode
     -7> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server 
handle_client_request client_request(client.108396030:57271 lookup 
#0x70001516236/012385530.npy 2023-05-02T20:37:19.675666+0200 RETRY=6 
caller_uid=135551, caller_gid=11157{0,4,27,11157,}) v5
     -6> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server 
handle_client_request client_request(client.104073212:5109945 readdir 
#0x70001516236 2023-05-02T20:36:29.517066+0200 RETRY=6 caller_uid=180090, 
caller_gid=11157{0,4,27,11157,}) v5
     -5> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server 
handle_client_request client_request(client.104288735:3008344 readdir 
#0x70001516236 2023-05-02T20:36:29.520801+0200 RETRY=6 caller_uid=135551, 
caller_gid=11157{0,4,27,11157,}) v5
     -4> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server 
handle_client_request client_request(client.8558540:46306346 readdir 
#0x700019ba15e 2023-05-01T21:26:34.303697+0200 RETRY=49 caller_uid=0, 
caller_gid=0{}) v2
     -3> 2023-05-03T08:12:43.688+0200 7f3458802700  4 mds.4.server 
handle_client_request client_request(client.96913903:2156912 create 
#0x1000b37db9a/street-photo-3.png 2023-05-01T17:27:37.454042+0200 RETRY=59 
caller_uid=271932, caller_gid=30034{}) v2
     -2> 2023-05-03T08:12:43.688+0200 7f345d00b700  5 mds.icadmin006 
handle_mds_map old map epoch 2638715 <= 2638715, discarding
```

and crashes:
```
     -1> 2023-05-03T08:12:43.692+0200 7f345d00b700 -1 
/build/ceph-16.2.10/src/mds/Server.cc: In function 'void 
Server::handle_client_open(MDRequestRef&)' thread 7f345d00b700 time 
2023-05-03T08:12:43.694660+0200
/build/ceph-16.2.10/src/mds/Server.cc: 4240: FAILED ceph_assert(cur->is_auth())

  ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific 
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x7f3462533d65]
  2: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d]
  3: (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) 
[0x558323c89f04]
  4: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) 
[0x558323c925ef]
  5: 
(Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) 
[0x558323cc3575]
  6: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) 
[0x558323d7460d]
  7: (MDSContext::complete(int)+0x61) [0x558323f68681]
  8: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, MDSContext*, 
bool, int)+0x3e) [0x558323d3edce]
  9: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce]
  10: (MDSContext::complete(int)+0x61) [0x558323f68681]
  11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xcf) [0x558323d5ff2f]
  12: (MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xbf) [0x558323d602df]
  13: (MDSContext::complete(int)+0x61) [0x558323f68681]
  14: (MDSRank::_advance_queues()+0x88) [0x558323c23c38]
  15: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, 
bool)+0x1fa) [0x558323c24a1a]
  16: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> 
const&)+0x5e) [0x558323c254fe]
  17: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) 
[0x558323bfd906]
  18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> 
const&)+0x460) [0x7f34627854e0]
  19: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f]
  20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1]
  21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609]
  22: clone()

      0> 2023-05-03T08:12:43.700+0200 7f345d00b700 -1 *** Caught signal 
(Aborted) **
  in thread 7f345d00b700 thread_name:ms_dispatch

  ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific 
(stable)
  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0) [0x7f34622843c0]
  2: gsignal()
  3: abort()
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1ad) [0x7f3462533dc0]
  5: /usr/lib/ceph/libceph-common.so.2(+0x265f6d) [0x7f3462533f6d]
  6: (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x1834) 
[0x558323c89f04]
  7: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x28f) 
[0x558323c925ef]
  8: 
(Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xa45) 
[0x558323cc3575]
  9: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x3d) 
[0x558323d7460d]
  10: (MDSContext::complete(int)+0x61) [0x558323f68681]
  11: (MDCache::_open_remote_dentry_finish(CDentry*, inodeno_t, MDSContext*, 
bool, int)+0x3e) [0x558323d3edce]
  12: (C_MDC_OpenRemoteDentry::finish(int)+0x3e) [0x558323de6cce]
  13: (MDSContext::complete(int)+0x61) [0x558323f68681]
  14: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xcf) [0x558323d5ff2f]
  15: (MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xbf) [0x558323d602df]
  16: (MDSContext::complete(int)+0x61) [0x558323f68681]
  17: (MDSRank::_advance_queues()+0x88) [0x558323c23c38]
  18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, 
bool)+0x1fa) [0x558323c24a1a]
  19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> 
const&)+0x5e) [0x558323c254fe]
  20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1d6) 
[0x558323bfd906]
  21: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> 
const&)+0x460) [0x7f34627854e0]
  22: (DispatchQueue::entry()+0x58f) [0x7f3462782d7f]
  23: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f346284eee1]
  24: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3462278609]
  25: clone()
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

--- logging levels ---
    0/ 5 none
    0/ 1 lockdep
    0/ 1 context
    1/ 1 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 1 buffer
    0/ 1 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 5 rbd_pwl
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 immutable_obj_cache
    0/ 5 client
    1/ 5 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 3 filestore
    1/ 3 journal
    0/ 0 ms
    1/ 5 mon
    0/10 monc
    1/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 crypto
    1/ 1 finisher
    1/ 1 reserver
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/ 5 rgw_sync
    1/10 civetweb
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
    0/ 0 refs
    1/ 5 compressor
    1/ 5 bluestore
    1/ 5 bluefs
    1/ 3 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    4/ 5 memdb
    1/ 5 fuse
    2/ 5 mgr
    1/ 5 mgrc
    1/ 5 dpdk
    1/ 5 eventtrace
    1/ 5 prioritycache
    0/ 5 test
    0/ 5 cephfs_mirror
    0/ 5 cephsqlite
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
   139862749464320 /
   139862757857024 / md_submit
   139862766249728 /
   139862774642432 / MR_Finisher
   139862791427840 / PQ_Finisher
   139862799820544 / mds_rank_progr
   139862808213248 / ms_dispatch
   139862841784064 / ceph-mds
   139862858569472 / safe_timer
   139862875354880 / ms_dispatch
   139862892140288 / io_context_pool
   139862908925696 / admin_socket
   139862917318400 / msgr-worker-2
   139862925711104 / msgr-worker-1
   139862934103808 / msgr-worker-0
   139862951257984 / ceph-mds
   max_recent     10000
   max_new        10000
   log_file /var/log/ceph/floki-mds.icadmin006.log
--- end dump of recent events ---

```

How could I troubleshoot that further?

Thanks in advance for your help,

Emmanuel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to