On Fri, Dec 13, 2024 at 7:09 AM Frank Schilder <fr...@dtu.dk> wrote:
>
> Dear all,
>
> I have a question about random ephemeral pinning that I can't find an answer 
> in the docs to. Question first and some background later. Docs checked for 
> any version from octopus up to latest. Our version for applying random 
> ephemeral pinning is pacific. What I would like to configure on a subtree is 
> this:
>
> Enable random ephemeral pinning at the root of the tree, say, /cephfs/root:
>
>    setfattr -n ceph.dir.pin.random -v 0.0001 /cephfs/root
>
> Will this have the following effects:
>
> A) The root of the tree /cephfs/root is ephemerally pinned to a rank 
> according to a consistent hash of its inode number.

No.

> B) Any descendant sub-directory may be ephemerally pinned 0.01 percent of the 
> time to a rank according to a consistent hash of its inode number.

Yes.

> The important difference to the docs is point A. I don't want to have *any* 
> subdir under the root /cephfs/root *not pinned* to an MDS. The docs only talk 
> about descendant sub-dirs, but the root is here important too because if it 
> is not pinned it will create a large number of unpinned dirfrags that float 
> around with expensive exportdir operations that pinning is there to avoid in 
> the first place.
>
> My questions are:
>
> 1) What does random ephemeral pinning do to the sub-tree root? Is it pinned 
> or not?

Not.

> 2) If it doesn't pin the root, does this work as intended or will it pin 
> everything to rank 1:
>
>    setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/root
>    setfattr -n ceph.dir.pin -v 1 /cephfs/root

That won't work currently but it probably should. I think as an
intermediate solution you could set the export pin on a parent of
"/cephfs/root".

> Background: We use cephfs as a home file system for an HPC cluster and are in 
> exactly in the situation of the example for distributed ephemeral pinning 
> (https://docs.ceph.com/en/latest/cephfs/multimds/?highlight=ephemeral+pin#setting-subtree-partitioning-policies)
>  *except* that the home-dirs of users differ dramatically in size.
>
> We were not pinning at first and this lead to our MDSes go crazy due to the 
> load balancer moving dirfrags around all the time. This "load balancing" was 
> itself responsible for 90-95% (!!!) of the total MDS load. After moving to 
> octopus we simulated distributed ephemeral pinning with a cron job that 
> assigned home-dirs in a round robin fashion to MDS ranks that had the least 
> pins. This immediately calmed down our entire MDS cluster (8 active and 4 
> stand-by) and user experience improved dramatically. MDS load dropped from 
> 125-150% (idle load!!) to about 20-25% per MDS and memory usage stabilized as 
> well.
>
> The easy way forward would be to replace our manual distribution with 
> distributed ephemeral pinning of /home (in octopus this was experimental, 
> after our recent upgrade to pacific we can use the built-in distribution). 
> However, as stated above, the size of home-dirs differs to a degree that 
> chunking up the file system into equally-sized sub-dir trees would be better 
> than distributing entire home dir trees over ranks. Users with very large 
> sub-trees might get spread out over more than one rank.
>
> This is what random ephemeral pinning seems to be there for and I would like 
> to chunk our entire filesystem up into sub-trees of size 10000-100000 
> directory fragments and distribute these over the MDSes. However, this only 
> works if the root and with it the first sub-tree is also pinned. Note that 
> this is not a problem with distributed ephemeral pinning, because this policy 
> pins *all* *immediate* children of the pin root and, therefore, does not 
> create free-floating directory fragments.
>
> I would be grateful if someone could shed light on the question whether or 
> not the pin root of random ephemeral pinning is itself pinned or not.

You could do both distributed and random:

setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/home/*

You'd need to set the random pin whenever a new user directory is
created but that's probably acceptable? The advantage is that you'd
get a default "pretty good" distribution across ranks and then for
really large user directories it would split as you would expect.

Thanks for sharing your use-case.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to