On 30/04/2025 00:14, Conrad Hoffmann wrote:

Hello,

I am trying to get a better understanding of how Ceph handles changes to CRUSH rules, such as changing the failure domain. I performed this (maybe somewhat academic, sorry) excercise, and would love to verify my conclusions (or get a good explanation of my observations):

Starting point:
  - 2 Nodes
  - 6 OSDs each
  - osd_crush_chooseleaf_type = 0 (OSD)
  - pool size = 3
  - pool min size = 2

As expected, the default CRUSH rule would distribute each PG across three random OSDs, without regard for the host they are on.

I then added the following CRUSH rule (initially unused):

rule replicated_three_over_two {
    id 1
    type replicated
    step take default
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
}

Tests with `crushtool` gave me confidence that it worked as I expected: unlike the default rule, this one always made sure at least one copy was on a different host.

When I applied this to rule to an existing pool, for a moment I got a HEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGs being shown as inactive (cluster then proceeded to heal itself). This sort of surprised me. I have read threads like [1], where it seems even larger remappings go down without a hitch. I changed back to the default rule, this time with `osd set norebalance` set. But the outcome was the same (brief reduced availability, then healed).

Having read more about peering, I now assume that a very brief interruption due to peering is likely unavoidable?

But, more importantly, the length of the interruption probably does not scale with the size of the pool (I don't have much data in there right now). So changing the rule for a big pool would still mean only a brief interruption, is that correct?

Trying to wrap my head around what exactly the lower-level implications of changing the CRUSH rule could be, I remembered some forum thread I had read, but initially dismissed as somewhat incoherent. Someone claimed that one should set the pool's min_size to 1 when changing rules. For science, I gave this a try. Interestingly enough, the warning about reduced data availability didn't happen (even though there were again inactive/peering PGs).

I arrived at the following mental model of what happens on a CRUSH rule update: the lead OSD remains steady, but the placement of the two replicas is re-calculated. In case they need to be moved, peering needs to happen, briefly blocking. But, in case of min_size = 1, the lead OSD, I don't know, says YOLO and accepts writes without having finished peering? If so, why would the PG be considered inactive, though?

And, out of curiosity: besides the obvious possibility of corrupting your only copy of an object, are there any implications to setting min_size to 1 before changing a rule? Like, could a write in the right moment cause the peering process to permanently fail or something?

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/

Thanks in advance for any enlightenment :)
Conrad
_______________________________________________

I would not comment on specific corner case where you first set osd_crush_chooseleaf_type = 0 (osd) and have only 2 nodes, as there could be exceptions i am not aware of. But if i understand you, you are interested in general idea of how ceph handles changes to crush rules and understand how peering is involved..etc. So i will comment on general case (no osd_crush_chooseleaf_type changes, cluster with 3+ nodes).

Assume you have a pg in pool with failure domain host, mapped to OSDs A, B, C and is active-clean. Then you change rule to failure domain rack...the monitors will issue a new epoch (version) map which maps pg to OSD D, E, F.

To my understanding, once this mapping is computed, it really does not matter how it came to be: whether it came from Crush failure domain change (our case), or from OSDs or hosts being added or removed or by upmap override/bypass to Crush.

OSD D will then get this mapping and realizes it is primary for this pg, it will lookup from past epoch and understand the prev epoch had A, B, C in active clean state, so it will initiate peering with B, C, D, E, F (had the prev state not been active-clean, more history and more OSDs may be involved). D realizes that temporarily, it should make A a temporary primary (this may involve another epoch change, not sure). The agreement (peering result) among these OSDs will result in D, E, F being in the Up set and A, B, C being in the Acting set with A as the temporary acting primary.

After the peering step, the pg will be active serving i/o using A, B, C. A will then initiate backfill operation to D, E, F. Once backfiling completes, a new epoch is reached where D becomes the primary and D, E, F are both the Up set and Active set. A, B, C will not be involved with the pg anymore and D will instruct them to delete since the pg is now active+clean.

If general case you should not be observing peering stuck for a long time or inactive pgs. If however there are incorrect custom backfill settings that make backfill load/traffic too high for hardware being used, if this is the case and given that such crush changes can require a large portion of your data be moved, the stress load can cause OSD heartbeats and other traffic to timeout and you could see inactive and stuck pgs...but again this would be due to incorrect settings.

/maged

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to