I'll be damned.
I restarted the wedged mds and after a reasonable amount of time the
standby mds finished replaying and became active.
The cluster is now healthy and it seems the apps I have running on top of
cephfs have sorted themselves out too, I guess all the MDS really needed
was a stern bul
No no pinning now won't help anything... I was asking to understand if it's
likely there is balancing happening actively now. If you don't pin, then
it's likely.
Try the debug logs. And check the exports using something like :
ceph daemon mds.b get subtrees | jq '.[] | [.dir.path, .auth_first,
.e
No, I don't.
I guess I could pin a large part of the tree, if that's something that's
likely to help.
On Wed, 21 Apr 2021 at 21:02, Dan van der Ster wrote:
> You don't pin subtrees ?
> I would guess that something in the workload changed and it's triggering a
> particularly bad behavior in the
You don't pin subtrees ?
I would guess that something in the workload changed and it's triggering a
particularly bad behavior in the md balancer.
Increase debug_mds gradually on both mds's; hopefully that gives a hint as
to what it's doing.
.. dan
On Wed, Apr 21, 2021, 8:48 PM Flemming Frandsen
Not as of yet, it's steadily getting further behind.
We're now up to 6797 segments and there's still the same 14 long-running
operations that are all "cleaned up request".
Something is blocking trimming, normally I'd follow the advice of
restarting the mds:
https://docs.ceph.com/en/latest/cephfs/
Did this eventually clear?
We had something like this happen once when we changed an md export pin for
a very top level directory from mds.3 to mds.0. This triggered so much
subtree export work that it took something like 30 minutes to complete. In
our case the md segments kept growing into a few 1
I've gone through the clients mentioned by the ops in flight and none of
them are connected any more.
The number of segments that the MDS is behind on is rising steadily and the
ops_in_flight remain, this feels a lot like a catastrophe brewing.
The documentation suggests trying to restart the MDS