[ceph-users] Re: Understanding CRUSH rule changes to existing pools

Maged Mokhtar Wed, 30 Apr 2025 15:46:17 -0700

On 30/04/2025 00:14, Conrad Hoffmann wrote:

Hello,
I am trying to get a better understanding of how Ceph handles changesto CRUSH rules, such as changing the failure domain. I performed this(maybe somewhat academic, sorry) excercise, and would love to verifymy conclusions (or get a good explanation of my observations):
Starting point:
  - 2 Nodes
  - 6 OSDs each
  - osd_crush_chooseleaf_type = 0 (OSD)
  - pool size = 3
  - pool min size = 2
As expected, the default CRUSH rule would distribute each PG acrossthree random OSDs, without regard for the host they are on.
I then added the following CRUSH rule (initially unused):

rule replicated_three_over_two {
    id 1
    type replicated
    step take default
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
}
Tests with `crushtool` gave me confidence that it worked as Iexpected: unlike the default rule, this one always made sure at leastone copy was on a different host.
When I applied this to rule to an existing pool, for a moment I got aHEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGsbeing shown as inactive (cluster then proceeded to heal itself). Thissort of surprised me. I have read threads like [1], where it seemseven larger remappings go down without a hitch. I changed back to thedefault rule, this time with `osd set norebalance` set. But theoutcome was the same (brief reduced availability, then healed).
Having read more about peering, I now assume that a very briefinterruption due to peering is likely unavoidable?
But, more importantly, the length of the interruption probably doesnot scale with the size of the pool (I don't have much data in thereright now). So changing the rule for a big pool would still mean onlya brief interruption, is that correct?
Trying to wrap my head around what exactly the lower-levelimplications of changing the CRUSH rule could be, I remembered someforum thread I had read, but initially dismissed as somewhatincoherent. Someone claimed that one should set the pool's min_size to1 when changing rules. For science, I gave this a try. Interestinglyenough, the warning about reduced data availability didn't happen(even though there were again inactive/peering PGs).
I arrived at the following mental model of what happens on a CRUSHrule update: the lead OSD remains steady, but the placement of the tworeplicas is re-calculated. In case they need to be moved, peeringneeds to happen, briefly blocking. But, in case of min_size = 1, thelead OSD, I don't know, says YOLO and accepts writes without havingfinished peering? If so, why would the PG be considered inactive, though?
And, out of curiosity: besides the obvious possibility of corruptingyour only copy of an object, are there any implications to settingmin_size to 1 before changing a rule? Like, could a write in the rightmoment cause the peering process to permanently fail or something?
[1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/
Thanks in advance for any enlightenment :)
Conrad
_______________________________________________

I would not comment on specific corner case where you first setosd_crush_chooseleaf_type = 0 (osd) and have only 2 nodes, as therecould be exceptions i am not aware of. But if i understand you, you areinterested in general idea of how ceph handles changes to crush rulesand understand how peering is involved..etc. So i will comment ongeneral case (no osd_crush_chooseleaf_type changes, cluster with 3+ nodes).

Assume you have a pg in pool with failure domain host, mapped to OSDs A,B, C and is active-clean. Then you change rule to failure domainrack...the monitors will issue a new epoch (version) map which maps pgto OSD D, E, F.

To my understanding, once this mapping is computed, it really does notmatter how it came to be: whether it came from Crush failure domainchange (our case), or from OSDs or hosts being added or removed or byupmap override/bypass to Crush.

OSD D will then get this mapping and realizes it is primary for this pg,it will lookup from past epoch and understand the prev epoch had A, B, Cin active clean state, so it will initiate peering with B, C, D, E, F(had the prev state not been active-clean, more history and more OSDsmay be involved). D realizes that temporarily, it should make A atemporary primary (this may involve another epoch change, not sure). Theagreement (peering result) among these OSDs will result in D, E, F beingin the Up set and A, B, C being in the Acting set with A as thetemporary acting primary.

After the peering step, the pg will be active serving i/o using A, B, C.A will then initiate backfill operation to D, E, F. Once backfilingcompletes, a new epoch is reached where D becomes the primary and D, E,F are both the Up set and Active set. A, B, C will not be involved withthe pg anymore and D will instruct them to delete since the pg is nowactive+clean.

If general case you should not be observing peering stuck for a longtime or inactive pgs. If however there are incorrect custom backfillsettings that make backfill load/traffic too high for hardware beingused, if this is the case and given that such crush changes can requirea large portion of your data be moved, the stress load can cause OSDheartbeats and other traffic to timeout and you could see inactive andstuck pgs...but again this would be due to incorrect settings.


/maged

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Understanding CRUSH rule changes to existing pools

Reply via email to