Hello,

I am trying to get a better understanding of how Ceph handles changes to CRUSH rules, such as changing the failure domain. I performed this (maybe somewhat academic, sorry) excercise, and would love to verify my conclusions (or get a good explanation of my observations):

Starting point:
  - 2 Nodes
  - 6 OSDs each
  - osd_crush_chooseleaf_type = 0 (OSD)
  - pool size = 3
  - pool min size = 2

As expected, the default CRUSH rule would distribute each PG across three random OSDs, without regard for the host they are on.

I then added the following CRUSH rule (initially unused):

rule replicated_three_over_two {
        id 1
        type replicated
        step take default
        step choose firstn 0 type host
        step chooseleaf firstn 2 type osd
        step emit
}

Tests with `crushtool` gave me confidence that it worked as I expected: unlike the default rule, this one always made sure at least one copy was on a different host.

When I applied this to rule to an existing pool, for a moment I got a HEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGs being shown as inactive (cluster then proceeded to heal itself). This sort of surprised me. I have read threads like [1], where it seems even larger remappings go down without a hitch. I changed back to the default rule, this time with `osd set norebalance` set. But the outcome was the same (brief reduced availability, then healed).

Having read more about peering, I now assume that a very brief interruption due to peering is likely unavoidable?

But, more importantly, the length of the interruption probably does not scale with the size of the pool (I don't have much data in there right now). So changing the rule for a big pool would still mean only a brief interruption, is that correct?

Trying to wrap my head around what exactly the lower-level implications of changing the CRUSH rule could be, I remembered some forum thread I had read, but initially dismissed as somewhat incoherent. Someone claimed that one should set the pool's min_size to 1 when changing rules. For science, I gave this a try. Interestingly enough, the warning about reduced data availability didn't happen (even though there were again inactive/peering PGs).

I arrived at the following mental model of what happens on a CRUSH rule update: the lead OSD remains steady, but the placement of the two replicas is re-calculated. In case they need to be moved, peering needs to happen, briefly blocking. But, in case of min_size = 1, the lead OSD, I don't know, says YOLO and accepts writes without having finished peering? If so, why would the PG be considered inactive, though?

And, out of curiosity: besides the obvious possibility of corrupting your only copy of an object, are there any implications to setting min_size to 1 before changing a rule? Like, could a write in the right moment cause the peering process to permanently fail or something?

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/

Thanks in advance for any enlightenment :)
Conrad
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to