Hi Patrick, thanks for your answers. We can't pin the directory above /cephfs/root as it is the root of the ceph-fs itself, which doesn't accept any pinning. Following your explanation and the docs, I'm also not sure what the original/intended use-case for random pinning was/is. To me it makes no sense to have some part pinned and a potentially very large part at the pinning-root unpinned (with a pinning probability of 0.01 and a depth-first walk we talk about an initial sub-tree of depth up to 100 for any descendant; take a full binary tree as a file system - that's a potentially huge unpinned sub-tree hanging at the pin-root).
Setting ephemeral pinning according to something like setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/home/* will work for us. Are there any stats on the size/depth of sub-trees pinned to the same rank under random ephemeral pinning with access patterns like "depth-first walk", "broad-first walk", and "random-leaf walk"? Similarly, are there actual stats for long-term equilibrium sizes on sub-trees pinned to the same rank under constant random access loads? Kind of any information that would help choosing a reasonable probability value for our home-dir sizes. The practical result of random pinning is kind of unintuitive and it would be great to have some examples with stats. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Patrick Donnelly <pdonn...@redhat.com> Sent: Wednesday, December 18, 2024 4:52 AM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Random ephemeral pinning, what happens to sub-tree under pin root dir On Fri, Dec 13, 2024 at 7:09 AM Frank Schilder <fr...@dtu.dk> wrote: > > Dear all, > > I have a question about random ephemeral pinning that I can't find an answer > in the docs to. Question first and some background later. Docs checked for > any version from octopus up to latest. Our version for applying random > ephemeral pinning is pacific. What I would like to configure on a subtree is > this: > > Enable random ephemeral pinning at the root of the tree, say, /cephfs/root: > > setfattr -n ceph.dir.pin.random -v 0.0001 /cephfs/root > > Will this have the following effects: > > A) The root of the tree /cephfs/root is ephemerally pinned to a rank > according to a consistent hash of its inode number. No. > B) Any descendant sub-directory may be ephemerally pinned 0.01 percent of the > time to a rank according to a consistent hash of its inode number. Yes. > The important difference to the docs is point A. I don't want to have *any* > subdir under the root /cephfs/root *not pinned* to an MDS. The docs only talk > about descendant sub-dirs, but the root is here important too because if it > is not pinned it will create a large number of unpinned dirfrags that float > around with expensive exportdir operations that pinning is there to avoid in > the first place. > > My questions are: > > 1) What does random ephemeral pinning do to the sub-tree root? Is it pinned > or not? Not. > 2) If it doesn't pin the root, does this work as intended or will it pin > everything to rank 1: > > setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/root > setfattr -n ceph.dir.pin -v 1 /cephfs/root That won't work currently but it probably should. I think as an intermediate solution you could set the export pin on a parent of "/cephfs/root". > Background: We use cephfs as a home file system for an HPC cluster and are in > exactly in the situation of the example for distributed ephemeral pinning > (https://docs.ceph.com/en/latest/cephfs/multimds/?highlight=ephemeral+pin#setting-subtree-partitioning-policies) > *except* that the home-dirs of users differ dramatically in size. > > We were not pinning at first and this lead to our MDSes go crazy due to the > load balancer moving dirfrags around all the time. This "load balancing" was > itself responsible for 90-95% (!!!) of the total MDS load. After moving to > octopus we simulated distributed ephemeral pinning with a cron job that > assigned home-dirs in a round robin fashion to MDS ranks that had the least > pins. This immediately calmed down our entire MDS cluster (8 active and 4 > stand-by) and user experience improved dramatically. MDS load dropped from > 125-150% (idle load!!) to about 20-25% per MDS and memory usage stabilized as > well. > > The easy way forward would be to replace our manual distribution with > distributed ephemeral pinning of /home (in octopus this was experimental, > after our recent upgrade to pacific we can use the built-in distribution). > However, as stated above, the size of home-dirs differs to a degree that > chunking up the file system into equally-sized sub-dir trees would be better > than distributing entire home dir trees over ranks. Users with very large > sub-trees might get spread out over more than one rank. > > This is what random ephemeral pinning seems to be there for and I would like > to chunk our entire filesystem up into sub-trees of size 10000-100000 > directory fragments and distribute these over the MDSes. However, this only > works if the root and with it the first sub-tree is also pinned. Note that > this is not a problem with distributed ephemeral pinning, because this policy > pins *all* *immediate* children of the pin root and, therefore, does not > create free-floating directory fragments. > > I would be grateful if someone could shed light on the question whether or > not the pin root of random ephemeral pinning is itself pinned or not. You could do both distributed and random: setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home setfattr -n ceph.dir.pin.random -v 0.001 /cephfs/home/* You'd need to set the random pin whenever a new user directory is created but that's probably acceptable? The advantage is that you'd get a default "pretty good" distribution across ranks and then for really large user directories it would split as you would expect. Thanks for sharing your use-case. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io