Hello,
I am trying to get a better understanding of how Ceph handles changes to
CRUSH rules, such as changing the failure domain. I performed this
(maybe somewhat academic, sorry) excercise, and would love to verify my
conclusions (or get a good explanation of my observations):
Starting point:
- 2 Nodes
- 6 OSDs each
- osd_crush_chooseleaf_type = 0 (OSD)
- pool size = 3
- pool min size = 2
As expected, the default CRUSH rule would distribute each PG across
three random OSDs, without regard for the host they are on.
I then added the following CRUSH rule (initially unused):
rule replicated_three_over_two {
id 1
type replicated
step take default
step choose firstn 0 type host
step chooseleaf firstn 2 type osd
step emit
}
Tests with `crushtool` gave me confidence that it worked as I expected:
unlike the default rule, this one always made sure at least one copy was
on a different host.
When I applied this to rule to an existing pool, for a moment I got a
HEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGs being
shown as inactive (cluster then proceeded to heal itself). This sort of
surprised me. I have read threads like [1], where it seems even larger
remappings go down without a hitch. I changed back to the default rule,
this time with `osd set norebalance` set. But the outcome was the same
(brief reduced availability, then healed).
Having read more about peering, I now assume that a very brief
interruption due to peering is likely unavoidable?
But, more importantly, the length of the interruption probably does not
scale with the size of the pool (I don't have much data in there right
now). So changing the rule for a big pool would still mean only a brief
interruption, is that correct?
Trying to wrap my head around what exactly the lower-level implications
of changing the CRUSH rule could be, I remembered some forum thread I
had read, but initially dismissed as somewhat incoherent. Someone
claimed that one should set the pool's min_size to 1 when changing
rules. For science, I gave this a try. Interestingly enough, the warning
about reduced data availability didn't happen (even though there were
again inactive/peering PGs).
I arrived at the following mental model of what happens on a CRUSH rule
update: the lead OSD remains steady, but the placement of the two
replicas is re-calculated. In case they need to be moved, peering needs
to happen, briefly blocking. But, in case of min_size = 1, the lead OSD,
I don't know, says YOLO and accepts writes without having finished
peering? If so, why would the PG be considered inactive, though?
And, out of curiosity: besides the obvious possibility of corrupting
your only copy of an object, are there any implications to setting
min_size to 1 before changing a rule? Like, could a write in the right
moment cause the peering process to permanently fail or something?
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/
Thanks in advance for any enlightenment :)
Conrad
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io