We have run in to what looks like bug 36094 (https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and unfortunately now one of our ranks (Rank 1) won't start - it comes up for a few seconds before the assigned MDS crashes again with the below log entries. It would appear that OpenFileTable has somehow become corrupted, but it's not clear from any of the Ceph tool documentation if there is any way of clearing this.

Before we resort to deleting and recreating the cluster, are there any further recovery steps we can perform?

Thanks.

2019-08-27 16:10:50.775 7f2c94581700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f2c94581700 time 2019-08-27 16:10:50.774858 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f2ca064636b]
 2: (()+0x26e4f7) [0x7f2ca06464f7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b35) [0x557afbe49265]
 4: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
 5: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
 6: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
 7: (Context::complete(int)+0x9) [0x557afbbb0ef9]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
 9: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
 10: (()+0x7dd5) [0x7f2c9e284dd5]
 11: (clone()+0x6d) [0x7f2c9d36202d]

2019-08-27 16:10:50.777 7f2c94581700 -1 *** Caught signal (Aborted) **
 in thread 7f2c94581700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0xf5d0) [0x7f2c9e28c5d0]
 2: (gsignal()+0x37) [0x7f2c9d29a2c7]
 3: (abort()+0x148) [0x7f2c9d29b9b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x248) [0x7f2ca0646468]
 5: (()+0x26e4f7) [0x7f2ca06464f7]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b35) [0x557afbe49265]
 7: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
 8: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
 9: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
 10: (Context::complete(int)+0x9) [0x557afbbb0ef9]
 11: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
 12: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
 13: (()+0x7dd5) [0x7f2c9e284dd5]
 14: (clone()+0x6d) [0x7f2c9d36202d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to