Dear all,

I have a question about random ephemeral pinning that I can't find an answer in 
the docs to. Question first and some background later. Docs checked for any 
version from octopus up to latest. Our version for applying random ephemeral 
pinning is pacific. What I would like to configure on a subtree is this:

Enable random ephemeral pinning at the root of the tree, say, /cephfs/root:

   setfattr -n ceph.dir.pin.random -v 0.0001 /cephfs/root

Will this have the following effects:

A) The root of the tree /cephfs/root is ephemerally pinned to a rank according 
to a consistent hash of its inode number.
B) Any descendant sub-directory may be ephemerally pinned 0.01 percent of the 
time to a rank according to a consistent hash of its inode number.

The important difference to the docs is point A. I don't want to have *any* 
subdir under the root /cephfs/root *not pinned* to an MDS. The docs only talk 
about descendant sub-dirs, but the root is here important too because if it is 
not pinned it will create a large number of unpinned dirfrags that float around 
with expensive exportdir operations that pinning is there to avoid in the first 
place.

My questions are:

1) What does random ephemeral pinning do to the sub-tree root? Is it pinned or 
not?
2) If it doesn't pin the root, does this work as intended or will it pin 
everything to rank 1:

   setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/root
   setfattr -n ceph.dir.pin -v 1 /cephfs/root

Background: We use cephfs as a home file system for an HPC cluster and are in 
exactly in the situation of the example for distributed ephemeral pinning 
(https://docs.ceph.com/en/latest/cephfs/multimds/?highlight=ephemeral+pin#setting-subtree-partitioning-policies)
 *except* that the home-dirs of users differ dramatically in size.

We were not pinning at first and this lead to our MDSes go crazy due to the 
load balancer moving dirfrags around all the time. This "load balancing" was 
itself responsible for 90-95% (!!!) of the total MDS load. After moving to 
octopus we simulated distributed ephemeral pinning with a cron job that 
assigned home-dirs in a round robin fashion to MDS ranks that had the least 
pins. This immediately calmed down our entire MDS cluster (8 active and 4 
stand-by) and user experience improved dramatically. MDS load dropped from 
125-150% (idle load!!) to about 20-25% per MDS and memory usage stabilized as 
well.

The easy way forward would be to replace our manual distribution with 
distributed ephemeral pinning of /home (in octopus this was experimental, after 
our recent upgrade to pacific we can use the built-in distribution). However, 
as stated above, the size of home-dirs differs to a degree that chunking up the 
file system into equally-sized sub-dir trees would be better than distributing 
entire home dir trees over ranks. Users with very large sub-trees might get 
spread out over more than one rank.

This is what random ephemeral pinning seems to be there for and I would like to 
chunk our entire filesystem up into sub-trees of size 10000-100000 directory 
fragments and distribute these over the MDSes. However, this only works if the 
root and with it the first sub-tree is also pinned. Note that this is not a 
problem with distributed ephemeral pinning, because this policy pins *all* 
*immediate* children of the pin root and, therefore, does not create 
free-floating directory fragments.

I would be grateful if someone could shed light on the question whether or not 
the pin root of random ephemeral pinning is itself pinned or not.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to