Hi,
Quoting Stefan Kooman ([email protected]):
> > please apply following patch, thanks.
> >
> > diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
> > index c0f72d581d..2ca737470d 100644
> > --- a/src/mds/OpenFileTable.cc
> > +++ b/src/mds/OpenFileTable.cc
> > @@ -470,7 +470,11 @@ void OpenFileTable::commit(MDSInternalContextBase *c,
> > uint64_t log_seq, int op_p
> > }
> > if (omap_idx < 0) {
> > ++omap_num_objs;
> > - assert(omap_num_objs <= MAX_OBJECTS);
> > + if (omap_num_objs > MAX_OBJECTS) {
> > + dout(1) << "omap_num_objs " << omap_num_objs << dendl;
> > + dout(1) << "anchor_map size " << anchor_map.size() << dendl;
> > + assert(omap_num_objs <= MAX_OBJECTS);
> > + }
> > omap_num_items.resize(omap_num_objs);
> > omap_updates.resize(omap_num_objs);
> > omap_updates.back().clear = true;
>
> It took a while but an MDS server with this debug patch is now live (and
> up:active).
.... and it crashed again (and again) ... until we stopped the mds and
deleted the mds0_openfiles.0 from the metadata pool.
Here is the (debug) output:
2019-12-04 06:25:01.578 7f6200248700 -1 received signal: Hangup from pkill -1
-x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 3491) UID: 0
2019-12-04 20:19:58.043 7f61fc859700 0 mds.0.openfiles omap_num_objs 1025
2019-12-04 20:19:58.043 7f61fc859700 0 mds.0.openfiles anchor_map size 4417650
2019-12-04 20:19:58.043 7f61fc859700 -1
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void
OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread
7f61fc859700 time 2019-12-04 20:19:58.045875
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs
<= MAX_OBJECTS)
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14e) [0x7f6207d01b5e]
2: (()+0x2c4cb7) [0x7f6207d01cb7]
3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f)
[0x55e38662566f]
4: (MDLog::trim(int)+0x5a6) [0x55e386614666]
5: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b]
6: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c]
7: (Context::complete(int)+0x9) [0x55e3863894b9]
8: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329]
9: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d]
10: (()+0x76db) [0x7f62075b56db]
11: (clone()+0x3f) [0x7f620679b88f]
2019-12-04 20:19:58.043 7f61fc859700 -1 *** Caught signal (Aborted) **
in thread 7f61fc859700 thread_name:safe_timer
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x12890) [0x7f62075c0890]
2: (gsignal()+0xc7) [0x7f62066b8e97]
3: (abort()+0x141) [0x7f62066ba801]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x25f) [0x7f6207d01c6f]
5: (()+0x2c4cb7) [0x7f6207d01cb7]
6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f)
[0x55e38662566f]
7: (MDLog::trim(int)+0x5a6) [0x55e386614666]
8: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b]
9: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c]
10: (Context::complete(int)+0x9) [0x55e3863894b9]
11: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329]
12: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d]
13: (()+0x76db) [0x7f62075b56db]
14: (clone()+0x3f) [0x7f620679b88f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
A specific workload that *might* have triggered this: recursively deleting a
long
list of files and directories (~ 7 milion in total) with 5 "rm" processes
in parallel ...
Gr. Stefan
--
| BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / [email protected]
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com