Dear Dan, thanks for the fast reply!
> ... when you set mds_recall_max_decay_rate there is a side effect that all > session recall_caps_throttle's are re-initialized OK, something like this could be a problem with the number of clients we have. I guess, next time I wait for a service window and try it out without load on it. Or upgrade first :) > ... I've cc'd Patrick. Thanks a lot! It would be really good if we could resolve the mystery of extra snapshots in pool con-fs2-data2. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <d...@vanderster.com> Sent: 07 September 2021 14:20:58 To: Frank Schilder; Patrick Donnelly Cc: ceph-users Subject: Re: [ceph-users] Re: MDS daemons stuck in resolve, please help Hi, On Tue, Sep 7, 2021 at 1:55 PM Frank Schilder <fr...@dtu.dk> wrote: > > Hi Dan, > > I think I need to be a bit more precise. When I do the following (mimic > 13.2.10, latest): > > # ceph config dump | grep mds_recall_max_decay_rate > # [no output] > # ceph config get mds.0 mds_recall_max_decay_rate > 2.500000 > # ceph config set mds mds_recall_max_decay_rate 2.5 > # > > the MDS cluster immediately becomes unresponsive. Worse yet, newly spawned > MDS daemons also get stuck and are marked down after the beacon time-out. > Clearly, having the *same* value either as default or explicitly present in > the config data base leads to different behaviour. How is this possible > unless its a bug or it leads to execution of different code paths? Expected > behaviour clearly is: nothing happens. This is independent of current load. The code is here: https://github.com/ceph/ceph/blob/mimic/src/mds/SessionMap.cc#L1050 when you set mds_recall_max_decay_rate there is a side effect that all session recall_caps_throttle's are re-initialized. I don't understand why, but this clearly causes your MDS to stall. I assume this should be a one-off -- once things recover and you have a good recall config for your workloads, there shouldn't be any more stalling/side effects. It is of course possible that mimic is missing other fixes, that we have in nautilus++, that permit the config I discuss. I quickly checked and 13.2.10 has the code to `recall caps incrementally`, so it doesn't look obviously broken to me. But you might want to study those differences to see what mimic lacks in terms of known recall improvements. > The ceph fs status is currently > > # ceph fs status > con-fs2 - 1642 clients > ======= > +------+--------+---------+---------------+-------+-------+ > | Rank | State | MDS | Activity | dns | inos | > +------+--------+---------+---------------+-------+-------+ > | 0 | active | ceph-23 | Reqs: 434 /s | 2354k | 2266k | > | 1 | active | ceph-12 | Reqs: 6 /s | 3036k | 2960k | > | 2 | active | ceph-08 | Reqs: 513 /s | 1751k | 1613k | > | 3 | active | ceph-15 | Reqs: 523 /s | 1460k | 1365k | > +------+--------+---------+---------------+-------+-------+ OK those reqs look reasonable -- our MDSs have stable activity above 1000 or 2000 reqs/sec. Here's an example from our cluster right now: +------+--------+----------------------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+--------+----------------------+---------------+-------+-------+ | 0 | active | cephcpu21-46bb400fc8 | Reqs: 1550 /s | 44.5M | 18.5M | | 1 | active | cephcpu21-0c370531cf | Reqs: 1509 /s | 14.2M | 14.1M | | 2 | active | cephcpu21-4a93514bf3 | Reqs: 686 /s | 11.2M | 11.2M | +------+--------+----------------------+---------------+-------+-------+ > +---------------------+----------+-------+-------+ > | Pool | type | used | avail | > +---------------------+----------+-------+-------+ > | con-fs2-meta1 | metadata | 1372M | 1356G | > | con-fs2-meta2 | data | 0 | 1356G | > | con-fs2-data | data | 1361T | 6035T | > | con-fs2-data-ec-ssd | data | 239G | 4340G | > | con-fs2-data2 | data | 23.6T | 5487T | > +---------------------+----------+-------+-------+ > +-------------+ > | Standby MDS | > +-------------+ > | ceph-16 | > | ceph-14 | > | ceph-13 | > | ceph-17 | > | ceph-10 | > | ceph-24 | > | ceph-09 | > | ceph-11 | > +-------------+ > MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) > mimic (stable) > > It seems like it improved a bit, but it is still way below the averages I saw > before trying the cache trimming settings. I would usually have 2 MDSes with > an average activity of 2000-4000 requests per second with peaks at 10K and > higher (the highest I have seen was 18K) and 2 MDSes a bit less busy. All > this with exactly the same IO pattern from clients, nothing changed on the > client side during my attempts to set the cache trimming values. > > I wasn't implying a relation between snap trimming and caps recall. What I > said is that after the change and roll-back of the cache trimming parameters, > it looks like the snapshot trimming on (one of) the fs data pools seems to > have stopped - i.e. something within ceph stopped working properly as a > fall-out of the parameter changes and the cluster did not recover by itself > yet. > > Snapshots themselves cause an extreme performance drop. There seems to be a > bug in the kernel client that makes it spin over ceph_update_snap_trace and > here over sort like crazy, here a perf record of the critical section: > > + 99.32% 0.00% kworker/0:2 [kernel.kallsyms] [k] > ret_from_fork_nospec_begin > + 99.32% 0.00% kworker/0:2 [kernel.kallsyms] [k] kthread > + 99.32% 0.00% kworker/0:2 [kernel.kallsyms] [k] worker_thread > + 99.32% 0.00% kworker/0:2 [kernel.kallsyms] [k] process_one_work > + 99.31% 0.00% kworker/0:2 [libceph] [k] ceph_con_workfn > + 99.30% 0.00% kworker/0:2 [libceph] [k] try_read > + 99.27% 0.00% kworker/0:2 [ceph] [k] dispatch > + 99.26% 0.00% kworker/0:2 [ceph] [k] handle_reply > - 98.94% 0.06% kworker/0:2 [ceph] [k] > ceph_update_snap_trace > - 98.88% ceph_update_snap_trace > - 90.03% rebuild_snap_realms > - 90.01% rebuild_snap_realms > - 89.54% build_snap_context > + 36.11% sort > 15.59% __x86_indirect_thunk_rax > 14.64% cmpu64_rev > 13.03% __x86_indirect_thunk_r13 > 3.84% generic_swap > 0.64% ceph_create_snap_context > 3.51% _raw_qspin_lock > 2.47% __list_del_entry > 1.36% ceph_queue_cap_snap > 0.51% __ceph_caps_used > > I'm pretty sure its spinning over the exact same data over and over again > because of the following observation. If I make a fresh mount, the client > actually performs with high performance initially. It starts slowing down > dramatically as the cache fills up. This has also been reported in other > threads: > > https://tracker.ceph.com/issues/44100 > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ELWPK3QGARFTVAFULFOUDOTLUGIL4HLP/ > > I cannot see how it is not a bug that operations with no cache are fast and > with cache are slow. This issue is present in latest stable kernels, > currently I test with 5.9.9-1.el7.elrepo.x86_64. > > This is why I am so concerned now that the cache trimming parameter change > caused some internal degradation that, in turn, now leads to snapshots piling > up and killing performance completely. It would be very helpful to know how > ceph fs handles snapshots and how I can confirm that either everything > functions as expected, or I have a problem. I'm afraid, having fs data pools > with inconsistent snapshot counts points to a severe degradation. > > Maybe you could point one of the ceph fs devs to this problem? Yeah I certainly can't add more to clarify what you asked above; I simply don't know the snapshot code enough to speculate what might be going wrong here. I've cc'd Patrick. Cheers, dan > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dan van der Ster <d...@vanderster.com> > Sent: 06 September 2021 11:33 > To: Frank Schilder > Cc: ceph-users > Subject: Re: [ceph-users] Re: MDS daemons stuck in resolve, please help > > Hi Frank, > > That's unfortunate! Most of those options relax warnings and relax > when a client is considered having too many caps. > The option mds_recall_max_caps might be CPU intensive -- the MDS would > be busy recalling caps if indeed you have clients which are hammering > the MDSs with metadata workloads. > What is your current `ceph fs status` output? If you have very active > users, perhaps you can ask them to temporarily slow down and see the > impact on your cluster? > > I'm not aware of any relation between caps recall and snap trimming. > We don't use snapshots (until now some pacific tests) so I can't say > if that is relevant to this issue. > > -- dan > > > > > On Mon, Sep 6, 2021 at 11:18 AM Frank Schilder <fr...@dtu.dk> wrote: > > > > Hi Dan, > > > > unfortunately, setting these parameters crashed the MDS cluster and we now > > have severe performance issues. Particularly bad is > > mds_recall_max_decay_rate. Even just setting it to the default value > > immediately makes all MDS daemons unresponsive and get failed by the MONs. > > I already set the mds beacon time-out to 10 minutes to avoid MDS daemons > > getting marked down too early when they need to trim a large (oversized) > > cache. The formerly active then failed daemons never recover, I have to > > restart them manually to get them back as stand-bys. > > > > We are running mimic-13.2.10. Does explicitly setting > > mds_recall_max_decay_rate enable a different code path in this version? > > > > I tried to fix the situation by removing all modified config pars (ceph > > config rm ...) again and doing a full restart of all daemons, first all > > stand-bys and then the active ones one by one. Unfortunately, this did not > > help. In addition, it looks like one of our fs data pools does not purge > > snapshots any more: > > > > pool 12 'con-fs2-meta1' no removed_snaps list shown > > pool 13 'con-fs2-meta2' removed_snaps > > [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1] > > pool 14 'con-fs2-data' removed_snaps > > [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1] > > pool 17 'con-fs2-data-ec-ssd' removed_snaps > > [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1] > > pool 19 'con-fs2-data2' removed_snaps > > [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1] > > > > con-fs2-meta2 is the primary data pool. It does not store actual file data, > > we have con-fs2-data2 set as data pool on the fs root. Its the new > > recommended 3-pool layout with the meta-data- and the primary data pool > > storing meta-data only. > > > > The MDS daemons report 12 snapshots and if I interpret the removed_snaps > > info correctly, the pools con-fs2-meta2, con-fs2-data and > > con-fs2-data-ec-ssd store 12 snapshots. However, pool con-fs2-data2 has at > > least 20. We use rolling snapshots and it looks like the snapshots are not > > purged any more since I tried setting the MDS trimming parameters. This, in > > turn, is potentially a reason for the performance degradation we experience > > at the moment. > > > > I would be most grateful if you could provide some pointers as to what to > > look for with regards of why snapshots don't disappear and/or what might > > have happened to our MDS daemons performance wise. > > > > Thanks and best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > [... truncated] _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io