On 30/04/2025 00:14, Conrad Hoffmann wrote:
Hello,
I am trying to get a better understanding of how Ceph handles changes
to CRUSH rules, such as changing the failure domain. I performed this
(maybe somewhat academic, sorry) excercise, and would love to verify
my conclusions (or get a good explanation of my observations):
Starting point:
- 2 Nodes
- 6 OSDs each
- osd_crush_chooseleaf_type = 0 (OSD)
- pool size = 3
- pool min size = 2
As expected, the default CRUSH rule would distribute each PG across
three random OSDs, without regard for the host they are on.
I then added the following CRUSH rule (initially unused):
rule replicated_three_over_two {
id 1
type replicated
step take default
step choose firstn 0 type host
step chooseleaf firstn 2 type osd
step emit
}
Tests with `crushtool` gave me confidence that it worked as I
expected: unlike the default rule, this one always made sure at least
one copy was on a different host.
When I applied this to rule to an existing pool, for a moment I got a
HEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGs
being shown as inactive (cluster then proceeded to heal itself). This
sort of surprised me. I have read threads like [1], where it seems
even larger remappings go down without a hitch. I changed back to the
default rule, this time with `osd set norebalance` set. But the
outcome was the same (brief reduced availability, then healed).
Having read more about peering, I now assume that a very brief
interruption due to peering is likely unavoidable?
But, more importantly, the length of the interruption probably does
not scale with the size of the pool (I don't have much data in there
right now). So changing the rule for a big pool would still mean only
a brief interruption, is that correct?
Trying to wrap my head around what exactly the lower-level
implications of changing the CRUSH rule could be, I remembered some
forum thread I had read, but initially dismissed as somewhat
incoherent. Someone claimed that one should set the pool's min_size to
1 when changing rules. For science, I gave this a try. Interestingly
enough, the warning about reduced data availability didn't happen
(even though there were again inactive/peering PGs).
I arrived at the following mental model of what happens on a CRUSH
rule update: the lead OSD remains steady, but the placement of the two
replicas is re-calculated. In case they need to be moved, peering
needs to happen, briefly blocking. But, in case of min_size = 1, the
lead OSD, I don't know, says YOLO and accepts writes without having
finished peering? If so, why would the PG be considered inactive, though?
And, out of curiosity: besides the obvious possibility of corrupting
your only copy of an object, are there any implications to setting
min_size to 1 before changing a rule? Like, could a write in the right
moment cause the peering process to permanently fail or something?
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/
Thanks in advance for any enlightenment :)
Conrad
_______________________________________________
I would not comment on specific corner case where you first set
osd_crush_chooseleaf_type = 0 (osd) and have only 2 nodes, as there
could be exceptions i am not aware of. But if i understand you, you are
interested in general idea of how ceph handles changes to crush rules
and understand how peering is involved..etc. So i will comment on
general case (no osd_crush_chooseleaf_type changes, cluster with 3+ nodes).
Assume you have a pg in pool with failure domain host, mapped to OSDs A,
B, C and is active-clean. Then you change rule to failure domain
rack...the monitors will issue a new epoch (version) map which maps pg
to OSD D, E, F.
To my understanding, once this mapping is computed, it really does not
matter how it came to be: whether it came from Crush failure domain
change (our case), or from OSDs or hosts being added or removed or by
upmap override/bypass to Crush.
OSD D will then get this mapping and realizes it is primary for this pg,
it will lookup from past epoch and understand the prev epoch had A, B, C
in active clean state, so it will initiate peering with B, C, D, E, F
(had the prev state not been active-clean, more history and more OSDs
may be involved). D realizes that temporarily, it should make A a
temporary primary (this may involve another epoch change, not sure). The
agreement (peering result) among these OSDs will result in D, E, F being
in the Up set and A, B, C being in the Acting set with A as the
temporary acting primary.
After the peering step, the pg will be active serving i/o using A, B, C.
A will then initiate backfill operation to D, E, F. Once backfiling
completes, a new epoch is reached where D becomes the primary and D, E,
F are both the Up set and Active set. A, B, C will not be involved with
the pg anymore and D will instruct them to delete since the pg is now
active+clean.
If general case you should not be observing peering stuck for a long
time or inactive pgs. If however there are incorrect custom backfill
settings that make backfill load/traffic too high for hardware being
used, if this is the case and given that such crush changes can require
a large portion of your data be moved, the stress load can cause OSD
heartbeats and other traffic to timeout and you could see inactive and
stuck pgs...but again this would be due to incorrect settings.
/maged
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io