Hello ceph-users. I'm operating a moderately large ceph cluster with
cephfs. We currently have 288 osd's, made up of all 10TB drives, and are
getting ready to migrate another 432 drives into the cluster (I'm going to
have more questions on that later). Our workload is highly distributed
(containerized clients running across 32 hosts comprising in excess of 30k
"clients"). We're operating with six active mds, each running with about
32GB of cache. Metadata is stored with replica 3, actual data stored with
replica 2 (not sure if this matters for this discussion).

Generally speaking, performance is pretty OK. We realize that we are making
some compromises in terms of memory for the mds cache, and we are
definitely making some compromises on available memory for osd's and are
intentionally limiting them a bit while we wait on additional resources.

While working on examining performance under load at scale, I see a marked
performance improvement whenever I would restart certain mds daemons. I was
able to duplicate the performance improvement by issuing a "daemon mds.blah
cache drop". The performance bump lasts for quite a long time--far longer
than it takes for the cache to "fill" according to the stats.

I've run across a couple of settings, but can't find much documentation on
them. I'm wondering if someone can help explain to me why I might see that
bump, and what I might be able to tune to increase performance there.

For reference, the majority of client accesses are into large directories
of the form:

/root/files/hash[0:2]/hash[0:4]/hash

I realize that this impacts locking and fragmentation. I am hoping someone
can help to decypher some of the mds config options so that I can see where
making some changes might help.

Additionally, I noted a few client options that raised some questions.
First, "client use random mds": According to the mimic docs, this is
"false" by default. If that is the case, how does a client choose an mds to
communicate with? On top of that, does it stick with that mds forever? When
I look at our mds daemons, they all list connections from each client.
We're using fuse after having serious issues with the kernel driver on
RHEL7 (mount would go stale for unknown reasons, and a full system reboot
was required to clear the held capabilities from the mds cluster and
recover the mount on the affected system).

There are also "client caps release delay," which seems like it might help
us if we were to decrease that number so a client wouldn't necessarily hold
on to a directory for as long as it might by default. There are a few cache
options, too, that I want to understand.

I know this is long, but hopefully it makes sense and someone can give me a
few pointers. If you need additional information to comment, please feel
free to ask!

--
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to