[ceph-users] Understanding CRUSH rule changes to existing pools

Conrad Hoffmann Tue, 29 Apr 2025 14:15:50 -0700

Hello,

I am trying to get a better understanding of how Ceph handles changes toCRUSH rules, such as changing the failure domain. I performed this(maybe somewhat academic, sorry) excercise, and would love to verify myconclusions (or get a good explanation of my observations):


Starting point:
  - 2 Nodes
  - 6 OSDs each
  - osd_crush_chooseleaf_type = 0 (OSD)
  - pool size = 3
  - pool min size = 2

As expected, the default CRUSH rule would distribute each PG acrossthree random OSDs, without regard for the host they are on.


I then added the following CRUSH rule (initially unused):

rule replicated_three_over_two {
        id 1
        type replicated
        step take default
        step choose firstn 0 type host
        step chooseleaf firstn 2 type osd
        step emit
}

Tests with `crushtool` gave me confidence that it worked as I expected:unlike the default rule, this one always made sure at least one copy wason a different host.

When I applied this to rule to an existing pool, for a moment I got aHEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGs beingshown as inactive (cluster then proceeded to heal itself). This sort ofsurprised me. I have read threads like [1], where it seems even largerremappings go down without a hitch. I changed back to the default rule,this time with `osd set norebalance` set. But the outcome was the same(brief reduced availability, then healed).

Having read more about peering, I now assume that a very briefinterruption due to peering is likely unavoidable?

But, more importantly, the length of the interruption probably does notscale with the size of the pool (I don't have much data in there rightnow). So changing the rule for a big pool would still mean only a briefinterruption, is that correct?

Trying to wrap my head around what exactly the lower-level implicationsof changing the CRUSH rule could be, I remembered some forum thread Ihad read, but initially dismissed as somewhat incoherent. Someoneclaimed that one should set the pool's min_size to 1 when changingrules. For science, I gave this a try. Interestingly enough, the warningabout reduced data availability didn't happen (even though there wereagain inactive/peering PGs).

I arrived at the following mental model of what happens on a CRUSH ruleupdate: the lead OSD remains steady, but the placement of the tworeplicas is re-calculated. In case they need to be moved, peering needsto happen, briefly blocking. But, in case of min_size = 1, the lead OSD,I don't know, says YOLO and accepts writes without having finishedpeering? If so, why would the PG be considered inactive, though?

And, out of curiosity: besides the obvious possibility of corruptingyour only copy of an object, are there any implications to settingmin_size to 1 before changing a rule? Like, could a write in the rightmoment cause the peering process to permanently fail or something?

[1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/


Thanks in advance for any enlightenment :)
Conrad
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Understanding CRUSH rule changes to existing pools

Reply via email to