[ceph-users] Re: Cephfs mds not trimming after cluster outage

Eugen Block Wed, 29 Jan 2025 00:54:21 -0800

Hi,

have you resolved this issue in the meantime? If not, what is yourmds_cache_memory_limit? Increasing that and maybe mds_log_max_segmentscould help with that.


Anything in ceph tell mds.{MDS} dump_blocked_ops?

Regards,
Eugen

Zitat von Adam Prycki <apry...@man.poznan.pl>:

Hello,

we are having issues with cephfs cluster.
Any help would be appreciated.

We are running still on 18.2.0.
During holidays we had outage caused by filling up rootfs. OSDsstarted randomly dying and we had time when not all PGs were active.This issue is already solved and all OSDs work fine but we're stuckwith some MDS issues.
warnings we are concerned about:

[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.arm-vol.k02r04nvm01.zaqebs(mds.0): 29 slow metadata IOs areblocked > 30 secs, oldest blocked for 1899 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.arm-vol.k02r04nvm01.zaqebs(mds.0): Behind on trimming(4851/128) max_segments: 128, num_segments: 4851
1. Out MDSs are not trimming.
2. our active MDS has metadata slow ops which we cannot understand

Cephfs status look ok, main MDS is active.
All metadata pool PGs are active and working, there are not laggy PGs.

Trying to dump ops from mds also doesn't help

ceph daemon ./ceph-mds.arm-vol.k02r04nvm01.zaqebs.asok dump_ops_in_flight
{
    "ops": [],
    "num_ops": 0
}

MDS failover or MDS restart also doesn't help.
Metadata slow ops always return after MDS restart. (all MDSs have this issue)
After failover main MDS is stuck in rejoin state for a long time.
We've used mds_wipe_sessions config option to bring it quickly intoactive state.
I'm guessing slow metadata ops are stopping MDS from trimming but wecannot figure out what is causing these slow ops.
Best regards
Adam Prycki



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cephfs mds not trimming after cluster outage

Reply via email to