[ceph-users] Re: MDS daemons stuck in resolve, please help

Frank Schilder Tue, 07 Sep 2021 08:56:51 -0700

Hi Dan,

I think I need to be a bit more precise. When I do the following (mimic 
13.2.10, latest):

# ceph config dump | grep mds_recall_max_decay_rate
# [no output]
# ceph config get mds.0 mds_recall_max_decay_rate
2.500000
# ceph config set mds mds_recall_max_decay_rate 2.5
# 

the MDS cluster immediately becomes unresponsive. Worse yet, newly spawned MDS 
daemons also get stuck and are marked down after the beacon time-out. Clearly, 
having the *same* value either as default or explicitly present in the config 
data base leads to different behaviour. How is this possible unless its a bug 
or it leads to execution of different code paths? Expected behaviour clearly 
is: nothing happens. This is independent of current load.

The ceph fs status is currently

# ceph fs status
con-fs2 - 1642 clients
=======
+------+--------+---------+---------------+-------+-------+
| Rank | State  |   MDS   |    Activity   |  dns  |  inos |
+------+--------+---------+---------------+-------+-------+
|  0   | active | ceph-23 | Reqs:  434 /s | 2354k | 2266k |
|  1   | active | ceph-12 | Reqs:    6 /s | 3036k | 2960k |
|  2   | active | ceph-08 | Reqs:  513 /s | 1751k | 1613k |
|  3   | active | ceph-15 | Reqs:  523 /s | 1460k | 1365k |
+------+--------+---------+---------------+-------+-------+
+---------------------+----------+-------+-------+
|         Pool        |   type   |  used | avail |
+---------------------+----------+-------+-------+
|    con-fs2-meta1    | metadata | 1372M | 1356G |
|    con-fs2-meta2    |   data   |    0  | 1356G |
|     con-fs2-data    |   data   | 1361T | 6035T |
| con-fs2-data-ec-ssd |   data   |  239G | 4340G |
|    con-fs2-data2    |   data   | 23.6T | 5487T |
+---------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ceph-16   |
|   ceph-14   |
|   ceph-13   |
|   ceph-17   |
|   ceph-10   |
|   ceph-24   |
|   ceph-09   |
|   ceph-11   |
+-------------+
MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) 
mimic (stable)

It seems like it improved a bit, but it is still way below the averages I saw 
before trying the cache trimming settings. I would usually have 2 MDSes with an 
average activity of 2000-4000 requests per second with peaks at 10K and higher 
(the highest I have seen was 18K) and 2 MDSes a bit less busy. All this with 
exactly the same IO pattern from clients, nothing changed on the client side 
during my attempts to set the cache trimming values.

I wasn't implying a relation between snap trimming and caps recall. What I said 
is that after the change and roll-back of the cache trimming parameters, it 
looks like the snapshot trimming on (one of) the fs data pools seems to have 
stopped - i.e. something within ceph stopped working properly as a fall-out of 
the parameter changes and the cluster did not recover by itself yet.

Snapshots themselves cause an extreme performance drop. There seems to be a bug 
in the kernel client that makes it spin over ceph_update_snap_trace and here 
over sort like crazy, here a perf record of the critical section:

+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] 
ret_from_fork_nospec_begin
+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] kthread
+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] worker_thread
+   99.32%     0.00%  kworker/0:2   [kernel.kallsyms]    [k] process_one_work
+   99.31%     0.00%  kworker/0:2   [libceph]            [k] ceph_con_workfn
+   99.30%     0.00%  kworker/0:2   [libceph]            [k] try_read
+   99.27%     0.00%  kworker/0:2   [ceph]               [k] dispatch
+   99.26%     0.00%  kworker/0:2   [ceph]               [k] handle_reply
-   98.94%     0.06%  kworker/0:2   [ceph]               [k] 
ceph_update_snap_trace
   - 98.88% ceph_update_snap_trace
      - 90.03% rebuild_snap_realms
         - 90.01% rebuild_snap_realms
            - 89.54% build_snap_context
               + 36.11% sort
                 15.59% __x86_indirect_thunk_rax
                 14.64% cmpu64_rev
                 13.03% __x86_indirect_thunk_r13
                 3.84% generic_swap
                 0.64% ceph_create_snap_context
        3.51% _raw_qspin_lock
        2.47% __list_del_entry
        1.36% ceph_queue_cap_snap
        0.51% __ceph_caps_used

I'm pretty sure its spinning over the exact same data over and over again 
because of the following observation. If I make a fresh mount, the client 
actually performs with high performance initially. It starts slowing down 
dramatically as the cache fills up. This has also been reported in other 
threads:

https://tracker.ceph.com/issues/44100
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ELWPK3QGARFTVAFULFOUDOTLUGIL4HLP/

I cannot see how it is not a bug that operations with no cache are fast and 
with cache are slow. This issue is present in latest stable kernels, currently 
I test with 5.9.9-1.el7.elrepo.x86_64.

This is why I am so concerned now that the cache trimming parameter change 
caused some internal degradation that, in turn, now leads to snapshots piling 
up and killing performance completely. It would be very helpful to know how 
ceph fs handles snapshots and how I can confirm that either everything 
functions as expected, or I have a problem. I'm afraid, having fs data pools 
with inconsistent snapshot counts points to a severe degradation.

Maybe you could point one of the ceph fs devs to this problem?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <d...@vanderster.com>
Sent: 06 September 2021 11:33
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] Re: MDS daemons stuck in resolve, please help

Hi Frank,

That's unfortunate! Most of those options relax warnings and relax
when a client is considered having too many caps.
The option mds_recall_max_caps might be CPU intensive -- the MDS would
be busy recalling caps if indeed you have clients which are hammering
the MDSs with metadata workloads.
What is your current `ceph fs status` output? If you have very active
users, perhaps you can ask them to temporarily slow down and see the
impact on your cluster?

I'm not aware of any relation between caps recall and snap trimming.
We don't use snapshots (until now some pacific tests) so I can't say
if that is relevant to this issue.

-- dan

On Mon, Sep 6, 2021 at 11:18 AM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan,
>
> unfortunately, setting these parameters crashed the MDS cluster and we now 
> have severe performance issues. Particularly bad is 
> mds_recall_max_decay_rate. Even just setting it to the default value 
> immediately makes all MDS daemons unresponsive and get failed by the MONs. I 
> already set the mds beacon time-out to 10 minutes to avoid MDS daemons 
> getting marked down too early when they need to trim a large (oversized) 
> cache. The formerly active then failed daemons never recover, I have to 
> restart them manually to get them back as stand-bys.
>
> We are running mimic-13.2.10. Does explicitly setting 
> mds_recall_max_decay_rate enable a different code path in this version?
>
> I tried to fix the situation by removing all modified config pars (ceph 
> config rm ...) again and doing a full restart of all daemons, first all 
> stand-bys and then the active ones one by one. Unfortunately, this did not 
> help. In addition, it looks like one of our fs data pools does not purge 
> snapshots any more:
>
> pool 12 'con-fs2-meta1' no removed_snaps list shown
> pool 13 'con-fs2-meta2' removed_snaps 
> [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> pool 14 'con-fs2-data' removed_snaps 
> [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> pool 17 'con-fs2-data-ec-ssd' removed_snaps 
> [2~18e,191~2c,1be~144,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
> pool 19 'con-fs2-data2' removed_snaps 
> [2d6~1,2d8~1,2da~1,2dc~1,2de~1,2e0~1,2e2~1,2e4~1,2e6~1,2e8~1,2ea~18,303~1,305~1,307~1,309~1,30b~1,30d~1,30f~1,311~1,313~1,315~1]
>
> con-fs2-meta2 is the primary data pool. It does not store actual file data, 
> we have con-fs2-data2 set as data pool on the fs root. Its the new 
> recommended 3-pool layout with the meta-data- and the primary data pool 
> storing meta-data only.
>
> The MDS daemons report 12 snapshots and if I interpret the removed_snaps info 
> correctly, the pools con-fs2-meta2, con-fs2-data and con-fs2-data-ec-ssd 
> store 12 snapshots. However, pool con-fs2-data2 has at least 20. We use 
> rolling snapshots and it looks like the snapshots are not purged any more 
> since I tried setting the MDS trimming parameters. This, in turn, is 
> potentially a reason for the performance degradation we experience at the 
> moment.
>
> I would be most grateful if you could provide some pointers as to what to 
> look for with regards of why snapshots don't disappear and/or what might have 
> happened to our MDS daemons performance wise.
>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> [... truncated]
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS daemons stuck in resolve, please help

Reply via email to