[ceph-users] Re: Understanding CRUSH rule changes to existing pools

Anthony D'Atri Tue, 29 Apr 2025 15:03:37 -0700

> 
> When I applied this to rule to an existing pool, for a moment I got a 
> HEALTH_WARN "Reduced data availability: 6 pgs peering" and the PGs being 
> shown as inactive (cluster then proceeded to heal itself). This sort of 
> surprised me. I have read threads like [1], where it seems even larger 
> remappings go down without a hitch. I changed back to the default rule, this 
> time with `osd set norebalance` set. But the outcome was the same (brief 
> reduced availability, then healed).


My experience has been to disregard these ephemeral indications of inactive PGs 
while peering.  This is either illusory or so ephemeral that it doesn’t matter.

> Having read more about peering, I now assume that a very brief interruption 
> due to peering is likely unavoidable?

Yeppers.

> But, more importantly, the length of the interruption probably does not scale 
> with the size of the pool (I don't have much data in there right now).

Most likely not, no.  The data exchanged during peering is not extensive 
compared to payload data.

> So changing the rule for a big pool would still mean only a brief 
> interruption, is that correct?

In respect to peering, yes.  But if that rule change will cause a lot of data 
to move around, say you change the failure domain or device class or something 
like size=4, you’ll have a bunch of backfill.  This deck

https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf
Mastering Ceph Operations with Upmap
PDF Document · 1.8 MB


shows how to use pg-umpap† to throttle such a thundering herd to limit client 
impact.  This is especially valuable with HDD OSDs, but also helps in other 
situations where somethings wrong and you want to stop backfilling NOW and get 
to HEALTH_OK.




† if your cluster and all clients are at least Luminous

> Trying to wrap my head around what exactly the lower-level implications of 
> changing the CRUSH rule could be, I remembered some forum thread I had read, 
> but initially dismissed as somewhat incoherent.

Sysadmins being incoherent?  Nah, never happens.

> Someone claimed that one should set the pool's min_size to 1 when changing 
> rules.

https://44.media.tumblr.com/3e5f11f6a74fe641d85a42c054548681/tumblr_ox8itzF2fs1vbcnq8o1_500.gif

That would not serve Vaal.

You mostly want size=1 in only very rare and specific circumstances.  If you 
don’t know EXACTLY what you’re doing, don't ever set size=1.  I’ve done it 
maybe 3 times in my career, and reverted ASAP.

* Your data is truly disposable, like a scratchpad or if it can be regenerated
* You’ve suffered overlapping failures and you need to get PGs active again, in 
which case one may briefly set but IMMEDIATELY increase once the PG has 
recovered. 


> I arrived at the following mental model of what happens on a CRUSH rule 
> update: the lead OSD remains steady, but the placement of the two replicas is 
> re-calculated.

All OSD placements can change.  Say you are adding in a constraint to a device 
class (as IMHO every rule should do), what would otherwise happen if the lead 
OSD were not of the desired device class?

> In case they need to be moved, peering needs to happen, briefly blocking. 
> But, in case of min_size = 1, the lead OSD, I don't know, says YOLO and 
> accepts writes without having finished peering? If so, why would the PG be 
> considered inactive, though?

I will defer to RADOS gods for that detail, but I suspect that since each PG’s 
acting/up set is only 1 OSD, there’s no peering to be done.  Like Jesse Jackson 
said, the question is moot.

> And, out of curiosity: besides the obvious possibility of corrupting your 
> only copy of an object, are there any implications to setting min_size to 1 
> before changing a rule?

That’s also very possible with a size=2 pool, due to the nature of a 
distributed systems.  There are certain sequences of [re]starts and crashes 
that may result in either OSD clearly having the latest copy of data.  I have 
seen this with my own eyes (after having warned the responsible parties).  Data 
was lost.  ITYS.

People do this day in and day out without touching min_size.

> Like, could a write in the right moment cause the peering process to 
> permanently fail or something?

Or in the *wrong* moment ;)

If you mess with changing CRUSH rules and other manual edits, be sure to keep 
around a backup copy of the text CRUSHmap in case you need to revert.


> 
> [1] 
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/UUXT4EBPZVO5T6WY3OS6NAKHRXDGQBSD/
> 
> Thanks in advance for any enlightenment :)
> Conrad
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Understanding CRUSH rule changes to existing pools

Reply via email to