[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Flemming Frandsen
I'll be damned. I restarted the wedged mds and after a reasonable amount of time the standby mds finished replaying and became active. The cluster is now healthy and it seems the apps I have running on top of cephfs have sorted themselves out too, I guess all the MDS really needed was a stern bul

[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Dan van der Ster
No no pinning now won't help anything... I was asking to understand if it's likely there is balancing happening actively now. If you don't pin, then it's likely. Try the debug logs. And check the exports using something like : ceph daemon mds.b get subtrees | jq '.[] | [.dir.path, .auth_first, .e

[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Flemming Frandsen
No, I don't. I guess I could pin a large part of the tree, if that's something that's likely to help. On Wed, 21 Apr 2021 at 21:02, Dan van der Ster wrote: > You don't pin subtrees ? > I would guess that something in the workload changed and it's triggering a > particularly bad behavior in the

[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Dan van der Ster
You don't pin subtrees ? I would guess that something in the workload changed and it's triggering a particularly bad behavior in the md balancer. Increase debug_mds gradually on both mds's; hopefully that gives a hint as to what it's doing. .. dan On Wed, Apr 21, 2021, 8:48 PM Flemming Frandsen

[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Flemming Frandsen
Not as of yet, it's steadily getting further behind. We're now up to 6797 segments and there's still the same 14 long-running operations that are all "cleaned up request". Something is blocking trimming, normally I'd follow the advice of restarting the mds: https://docs.ceph.com/en/latest/cephfs/

[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Dan van der Ster
Did this eventually clear? We had something like this happen once when we changed an md export pin for a very top level directory from mds.3 to mds.0. This triggered so much subtree export work that it took something like 30 minutes to complete. In our case the md segments kept growing into a few 1

[ceph-users] Re: MDS_TRIM 1 MDSs behind on trimming and

2021-04-21 Thread Flemming Frandsen
I've gone through the clients mentioned by the ops in flight and none of them are connected any more. The number of segments that the MDS is behind on is rising steadily and the ops_in_flight remain, this feels a lot like a catastrophe brewing. The documentation suggests trying to restart the MDS