Long story short, we've got a lot of empty directories that I'm working on 
removing.  While removing directories, using "perf top -g" we can watch the MDS 
daemon go to 100% cpu usage with "SnapRealm:: split_at" and 
"CInode::is_ancestor_of".

It's this 2 year old bug that still is around.
https://tracker.ceph.com/issues/53192

To help combat this, we've moved our snapshot schedule down the tree one level 
so the snaprealm is significantly smaller.  Our luck with multiple active MDSs 
hasn't been great so we are still on a single MDS.  To help split the load, I'm 
working on moving different workloads to different filesytems within ceph.

A user can still fairly easily overwhelm the MDS's finisher thread and 
basically stop all cephfs io through that MDS.     I'm hoping we can get some 
other people chiming in with "Me Too!" so there can be some traction behind 
fixing this.  

It's a longstanding bug so the version is less important, but we are on 17.2.7.

Thoughts?
-paul

--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 “End users is a description, not a goal.”



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to