[ceph-users] Re: MDS daemons stuck in resolve, please help

Frank Schilder Tue, 07 Sep 2021 09:36:51 -0700

Dear Dan,

thanks for the fast reply!


> ... when you set mds_recall_max_decay_rate there is a side effect that all 
> session recall_caps_throttle's are re-initialized

OK, something like this could be a problem with the number of clients we have. 
I guess, next time I wait for a service window and try it out without load on 
it. Or upgrade first :)

> ... I've cc'd Patrick.

Thanks a lot! It would be really good if we could resolve the mystery of extra 
snapshots in pool con-fs2-data2.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <d...@vanderster.com>
Sent: 07 September 2021 14:20:58
To: Frank Schilder; Patrick Donnelly
Cc: ceph-users
Subject: Re: [ceph-users] Re: MDS daemons stuck in resolve, please help

Hi,

On Tue, Sep 7, 2021 at 1:55 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan,
>
> I think I need to be a bit more precise. When I do the following (mimic 
> 13.2.10, latest):
>
> # ceph config dump | grep mds_recall_max_decay_rate
> # [no output]
> # ceph config get mds.0 mds_recall_max_decay_rate
> 2.500000
> # ceph config set mds mds_recall_max_decay_rate 2.5
> #
>
> the MDS cluster immediately becomes unresponsive. Worse yet, newly spawned 
> MDS daemons also get stuck and are marked down after the beacon time-out. 
> Clearly, having the *same* value either as default or explicitly present in 
> the config data base leads to different behaviour. How is this possible 
> unless its a bug or it leads to execution of different code paths? Expected 
> behaviour clearly is: nothing happens. This is independent of current load.

The code is here:
https://github.com/ceph/ceph/blob/mimic/src/mds/SessionMap.cc#L1050
when you set mds_recall_max_decay_rate there is a side effect that all
session recall_caps_throttle's are re-initialized. I don't understand
why, but this clearly causes your MDS to stall. I assume this should
be a one-off -- once things recover and you have a good recall config
for your workloads, there shouldn't be any more stalling/side effects.

It is of course possible that mimic is missing other fixes, that we
have in nautilus++, that permit the config I discuss. I quickly
checked and 13.2.10 has the code to `recall caps incrementally`, so it
doesn't look obviously broken to me. But you might want to study those
differences to see what mimic lacks in terms of known recall
improvements.

> The ceph fs status is currently
>
> # ceph fs status
> con-fs2 - 1642 clients
> =======
> +------+--------+---------+---------------+-------+-------+
> | Rank | State  |   MDS   |    Activity   |  dns  |  inos |
> +------+--------+---------+---------------+-------+-------+
> |  0   | active | ceph-23 | Reqs:  434 /s | 2354k | 2266k |
> |  1   | active | ceph-12 | Reqs:    6 /s | 3036k | 2960k |
> |  2   | active | ceph-08 | Reqs:  513 /s | 1751k | 1613k |
> |  3   | active | ceph-15 | Reqs:  523 /s | 1460k | 1365k |
> +------+--------+---------+---------------+-------+-------+

OK those reqs look reasonable -- our MDSs have stable activity above
1000 or 2000 reqs/sec. Here's an example from our cluster right now:

+------+--------+----------------------+---------------+-------+-------+
| Rank | State  |         MDS          |    Activity   |  dns  |  inos |
+------+--------+----------------------+---------------+-------+-------+
|  0   | active | cephcpu21-46bb400fc8 | Reqs: 1550 /s | 44.5M | 18.5M |
|  1   | active | cephcpu21-0c370531cf | Reqs: 1509 /s | 14.2M | 14.1M |
|  2   | active | cephcpu21-4a93514bf3 | Reqs:  686 /s | 11.2M | 11.2M |
+------+--------+----------------------+---------------+-------+-------+


> +---------------------+----------+-------+-------+
> |         Pool        |   type   |  used | avail |
> +---------------------+----------+-------+-------+
> |    con-fs2-meta1    | metadata | 1372M | 1356G |
> |    con-fs2-meta2    |   data   |    0  | 1356G |
> |     con-fs2-data    |   data   | 1361T | 6035T |
> | con-fs2-data-ec-ssd |   data   |  239G | 4340G |
> |    con-fs2-data2    |   data   | 23.6T | 5487T |
> +---------------------+----------+-------+-------+
> +-------------+
> | Standby MDS |
> +-------------+
> |   ceph-16   |
> |   ceph-14   |
> |   ceph-13   |
> |   ceph-17   |
> |   ceph-10   |
> |   ceph-24   |
> |   ceph-09   |
> |   ceph-11   |
> +-------------+
> MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) 
> mimic (stable)
>
> It seems like it improved a bit, but it is still way below the averages I saw 
> before trying the cache trimming settings. I would usually have 2 MDSes with 
> an average activity of 2000-4000 requests per second with peaks at 10K and 
> higher (the highest I have seen was 18K) and 2 MDSes a bit less busy. All 
> this with exactly the same IO pattern from clients, nothing changed on the 
> client side during my attempts to set the cache trimming values.
>
> I wasn't implying a relation between snap trimming and caps recall. What I 
> said is that after the change and roll-back of the cache trimming parameters, 
> it looks like the snapshot trimming on (one of) the fs data pools seems to 
> have stopped - i.e. something within ceph stopped working properly as a 
> fall-out of the parameter changes and the cluster did not recover by itself 
> yet.
>
> Snapshots themselves cause an extreme performance drop. There seems to be a 
> bug in the kernel client that makes it spin over ceph_update_snap_trace and 
> here over sort like crazy, here a perf record of the critical section:
>
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] 
> ret_from_fork_nospec_begin
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] kthread
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] worker_thread
> +   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] process_one_work
> +   99.31%     0.00%  kworker/0:2   [libceph]            [k] ceph_con_workfn
> +   99.30%     0.00%  kworker/0:2   [libceph]            [k] try_read
> +   99.27%     0.00%  kworker/0:2   [ceph]               [k] dispatch
> +   99.26%     0.00%  kworker/0:2   [ceph]               [k] handle_reply
> -   98.94%     0.06%  kworker/0:2   [ceph]               [k] 
> ceph_update_snap_trace
>    - 98.88% ceph_update_snap_trace
>       - 90.03% rebuild_snap_realms
>          - 90.01% rebuild_snap_realms
>             - 89.54% build_snap_context
>                + 36.11% sort
>                  15.59% __x86_indirect_thunk_rax
>                  14.64% cmpu64_rev
>                  13.03% __x86_indirect_thunk_r13
>                  3.84% generic_swap
>                  0.64% ceph_create_snap_context
>         3.51% _raw_qspin_lock
>         2.47% __list_del_entry
>         1.36% ceph_queue_cap_snap
>         0.51% __ceph_caps_used
>
> I'm pretty sure its spinning over the exact same data over and over again 
> because of the following observation. If I make a fresh mount, the client 
> actually performs with high performance initially. It starts slowing down 
> dramatically as the cache fills up. This has also been reported in other 
> threads:
>
> https://tracker.ceph.com/issues/44100
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ELWPK3QGARFTVAFULFOUDOTLUGIL4HLP/
>
> I cannot see how it is not a bug that operations with no cache are fast and 
> with cache are slow. This issue is present in latest stable kernels, 
> currently I test with 5.9.9-1.el7.elrepo.x86_64.
>
> This is why I am so concerned now that the cache trimming parameter change 
> caused some internal degradation that, in turn, now leads to snapshots piling 
> up and killing performance completely. It would be very helpful to know how 
> ceph fs handles snapshots and how I can confirm that either everything 
> functions as expected, or I have a problem. I'm afraid, having fs data pools 
> with inconsistent snapshot counts points to a severe degradation.
>
> Maybe you could point one of the ceph fs devs to this problem?

Yeah I certainly can't add more to clarify what you asked above; I
simply don't know the snapshot code enough to speculate what might be
going wrong here. I've cc'd Patrick.

Cheers, dan


>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <d...@vanderster.com>
> Sent: 06 September 2021 11:33
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] Re: MDS daemons stuck in resolve, please help
>
> Hi Frank,
>
> That's unfortunate! Most of those options relax warnings and relax
> when a client is considered having too many caps.
> The option mds_recall_max_caps might be CPU intensive -- the MDS would
> be busy recalling caps if indeed you have clients which are hammering
> the MDSs with metadata workloads.
> What is your current `ceph fs status` output? If you have very active
> users, perhaps you can ask them to temporarily slow down and see the
> impact on your cluster?
>
> I'm not aware of any relation between caps recall and snap trimming.
> We don't use snapshots (until now some pacific tests) so I can't say
> if that is relevant to this issue.
>
> -- dan
>
>
>
>
> On Mon, Sep 6, 2021 at 11:18 AM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > Hi Dan,
> >
> > unfortunately, setting these parameters crashed the MDS cluster and we now 
> > have severe performance issues. Particularly bad is 
> > mds_recall_max_decay_rate. Even just setting it to the default value 
> > immediately makes all MDS daemons unresponsive and get failed by the MONs. 
> > I already set the mds beacon time-out to 10 minutes to avoid MDS daemons 
> > getting marked down too early when they need to trim a large (oversized) 
> > cache. The formerly active then failed daemons never recover, I have to 
> > restart them manually to get them back as stand-bys.
> >
> > We are running mimic-13.2.10. Does explicitly setting 
> > mds_recall_max_decay_rate enable a different code path in this version?
> >
> > I tried to fix the situation by removing all modified config pars (ceph 
> > config rm ...) again and doing a full restart of all daemons, first all 
> > stand-bys and then the active ones one by one. Unfortunately, this did not 
> > help. In addition, it looks like one of our fs data pools does not purge 
> > snapshots any more:
> >
> > pool 12 'con-fs2-meta1' no removed_snaps list shown
> > pool 13 'con-fs2-meta2' removed_snaps 
> > [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> > pool 14 'con-fs2-data' removed_snaps 
> > [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> > pool 17 'con-fs2-data-ec-ssd' removed_snaps 
> > [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> > pool 19 'con-fs2-data2' removed_snaps 
> > [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> >
> > con-fs2-meta2 is the primary data pool. It does not store actual file data, 
> > we have con-fs2-data2 set as data pool on the fs root. Its the new 
> > recommended 3-pool layout with the meta-data- and the primary data pool 
> > storing meta-data only.
> >
> > The MDS daemons report 12 snapshots and if I interpret the removed_snaps 
> > info correctly, the pools con-fs2-meta2, con-fs2-data and 
> > con-fs2-data-ec-ssd store 12 snapshots. However, pool con-fs2-data2 has at 
> > least 20. We use rolling snapshots and it looks like the snapshots are not 
> > purged any more since I tried setting the MDS trimming parameters. This, in 
> > turn, is potentially a reason for the performance degradation we experience 
> > at the moment.
> >
> > I would be most grateful if you could provide some pointers as to what to 
> > look for with regards of why snapshots don't disappear and/or what might 
> > have happened to our MDS daemons performance wise.
> >
> > Thanks and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > [... truncated]
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS daemons stuck in resolve, please help

Reply via email to