Hello,

We're running Ceph on Kubernetes 1.12 using the Rook operator (
https://rook.io), but we've been struggling to scale applications mounting
CephFS volumes above 600 pods / 300 nodes. All our instances use the kernel
client and run kernel `4.19.23-coreos-r1`.

We've tried increasing the MDS memory limits, running multiple active MDS
pods, and running different versions of Ceph (up to the latest Luminous and
Mimic releases), but we run into MDS_SLOW_REQUEST errors at the same scale
regardless of the memory limits we set. See this GitHub issue for more info
on what we've tried up to this point:
https://github.com/rook/rook/issues/2590

I've written a simple load test that reads all the files in a given
directory on an interval. While running this test, I've noticed that the
`mds_co.bytes` value (from `ceph daemon mds.myfs-a dump_mempools | jq -c
'.mempool.by_pool.mds_co'`) increases each time files are read. Why is this
number increasing after the first iteration? If the same client is reading
the same cached files, why would the data in the cache change at all? What
is `mds_co.bytes` actually reporting?

My most important question is this: How do I configure Ceph to be able to
scale to large numbers of clients?

Thanks,
Zack
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to