[ceph-users] Re: What is the problem with many PGs per OSD

Anthony D'Atri Thu, 10 Oct 2024 11:26:19 -0700


> Hi Anthony.
> 
>> ... Bump up pg_num on pools and see how the average / P90 ceph-osd process 
>> size changes?
>> Grafana FTW.  osd_map_cache_size I think defaults to 50 now; I want to say 
>> it used to be much higher.
> 
> That's not an option. What would help is a-priori information based on the 
> implementation


I think with so many variables in play that would be tough to quantify.

> I'm looking at a pool with 5PB of data and 8192 PGs. If I increase that by, 
> say a factor 4, its in one step, not gradual to avoid excessive redundant 
> data movement. I don't want to spend hardware life for nothing and also don't 
> want to wait for months or more for this to complete or get stuck along the 
> way due to something catastrophic.

Used to be that Ceph wouldn’t let you more than double pg_num in one step.  You 
might consider going to just, say, 9216 and see what happens.  Non power of 2 
pg_num isn’t THAT big a deal these days, you’ll end up some some PGs larger 
than others, but it’s not horrible for a short term.    My sense re hardware 
life is that writes due to rebalancing are trivial.  

> 
> What I would like to know is is there a fundamental scaling limit in the PG 
> implementation that someone who was staring at the code for a long time knows 
> about. This is usually something that grows much worse than N log N in time- 
> or memory complexity. The answer to this is in the code and boils down to 
> "why the recommendation of 100 PGs per OSD" and not 200 or 1000 or 100 per TB 
> - the latter would make a looooot more sense). There ought to be a reason 
> other than "we didn't know what else to write".
> 
> I would like to know the scaling in (worst-case) complexity as a function of 
> the number of PGs. Making a fixed recommendation of a specific number 
> independent of anything else is something really weird. It indicates that 
> there is something catastrophic in the code that will blow up once an 
> (unknown/undocumented!!) threshold is crossed. For example, a tiny but 
> important function that is exponential in the number of PGs. If there is 
> nothing catastrophic in the code, then why is the recommendation not 
> floating, specifying what increase in resource consumption one should expect.
> 
> None of the discussions I have seen so far address this extreme weirdness of 
> the recommendation. If there is an unsolved scaling problem, please anyone 
> state what it is, why its there and what the critical threshold is. What part 
> of the code will explode?
> 
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Anthony D'Atri <anthony.da...@gmail.com>
> Sent: Wednesday, October 9, 2024 3:52 PM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] What is the problem with many PGs per OSD
> 
>> Unfortunately, it doesn't really help answering my questions either.
> 
> 
> Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we 
> couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, 
> so a certain fear of overshooting was established.  Mark is the go-to here.
> 
>> That's why deploying multiple OSDs per SSD is such a great way to improve 
>> performance on devices where 4K random IO throughput scales with iodepth.
> 
> Mark’s testing have shown this to not be so much the case with recent 
> releases — do you still see this?  Until recently I was expecting 30TB TLC 
> SSDs for RBD, and in the next year perhaps as large as 122T for object so I 
> was thinking of splitting just because of the size - and the systems in 
> question were overequipped with CPU.
> 
> 
>> Memory: I have never used file store, so can't relate to that.
> 
> XFS - I experienced a lot of ballooning, to the point of OOMkilling.  In 
> mixed clusters under duress the BlueStore OSDs consistently behaved better.
> 
>> 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? 
>> How many OSDs per host?
> 
> Those were Cisco UCS… C240m3.  Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs 
> @ 3TB, 64GB I think.
> 
>> Did it even work with 200PGs with the same data (recovery after power loss)?
> 
> I didn’t have remote power control, and being a shared lab it was difficult 
> to take a cluster down for such testing.  We did have a larger integration 
> cluster (450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power 
> drop.  Ceph was fine (this was …. Firefly I think) but the LSI RoC HBAs lost 
> data like crazy due to hardware, firmware, and utility bugs.
> 
>> Was it maybe the death spiral 
>> (https://ceph-users.ceph.narkive.com/KAzvjjPc/explanation-for-ceph-osd-set-nodown-and-ceph-osd-cluster-snap)
>>  that prevented the cluster from coming up and not so much the PG count?
> 
> Not in this case, though I’ve seen a similar cascading issue in another 
> context.
> 
>> Rumors: Yes, 1000 PGs/OSD on spinners without issues. I guess we are not 
>> talking about barely working home systems with lack of all sorts of 
>> resources here.
> 
> I’d be curious how such systems behave under duress.  I’ve seen a cluster 
> that had grown - the mons ended up with enough RAM to run but not to boot, so 
> I did urgent RAM upgrades on the mons.  That was the mixed Filestore / 
> BlueStore cluster (Luminous 12.2.2) where the Filestore OSDs were much more 
> affected by a cascading event than the [mostly larger] BlueStore OSDs.  I 
> suspect that had the whole cluster been BlueStore it might not have cascaded.
> 
>> 
>> The goal: Let's say I want to go 500-1000PGs/OSD on 16T spinners to trim PGs 
>> to about 10-20G each. What are the resources that count will require 
>> compared with, say, 200 PGs/OSD? That's the interesting question and if I 
>> can make the resources available I would consider doing that.
> 
> The proof is in the proverbial pudding.  Bump up pg_num on pools and see how 
> the average / P90 ceph-osd process size changes?  Grafana FTW.  
> osd_map_cache_size I think defaults to 50 now; I want to say it used to be 
> much higher.
> 
> 
> 
>> 
>> Thanks and best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> ________________________________________
>> From: Anthony D'Atri <a...@dreamsnake.net>
>> Sent: Wednesday, October 9, 2024 2:40 AM
>> To: Frank Schilder
>> Cc: ceph-users@ceph.io
>> Subject: Re: [ceph-users] What is the problem with many PGs per OSD
>> 
>> I’ve sprinkled minimizers below.  Free advice and worth every penny.  ymmv.  
>> Do not taunt Happy Fun Ball.
>> 
>> 
>>> during a lot of discussions in the past the comment that having "many PGs 
>>> per OSD can lead to issues" came up without ever explaining what these 
>>> issues will (not might!) be or how one would notice. It comes up as kind of 
>>> a rumor without any factual or even anecdotal backing.
>> 
>> A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 
>> to help avoid OOMing, the idea being that more PGs = more RAM usage on each 
>> daemon that stores the maps.  With BlueStore’s osd_memory_target, my sense 
>> is that the ballooning seen with Filestore is much less of an issue.
>> 
>>> As far as I can tell from experience, any increase of resource utilization 
>>> due to an increase of the PG count per OSD is more than offset by the 
>>> performance impact of the reduced size of the PGs. Everything seems to 
>>> benefit from smaller PGs, recovery, user IO, scrubbing.
>> 
>> My understanding is that there is serialization in the PG code, and thus the 
>> PG ratio can be thought of as the degree of parallelism the OSD device can 
>> handle.  SAS/SATA SSDs don’t seek so they can handle more than HDDS, and 
>> NVMe devices can handle more than SAS/SATA.
>> 
>>> Yet, I'm holding back on an increase of PG count due to these rumors.
>> 
>> My personal sense:
>> 
>> HDD OSD:  PG ratio 100-200
>> SATA/SAS SSD OSD: 200-300
>> NVMe SSD OSD: 300-400
>> 
>> These are not empirical figures.  ymmv.
>> 
>> 
>>> My situation: I would like to split PGs on large HDDs. Currently, we have 
>>> on average 135PGs per OSD and I would like to go for something like 450.
>> 
>> The good Mr. Nelson may have more precise advice, but my personal sense is 
>> that I wouldn’t go higher than 200 on an HDD.  If you were at like 20 (I’ve 
>> seen it!) that would be a different story, my sense is that there are 
>> diminishing returns over say 150.  Seek thrashing fu, elevator scheduling 
>> fu, op re-ordering fu, etc.  Assuming you’re on Nautilus or later, it 
>> doesn’t hurt to experiment with your actual workload since you can scale 
>> pg_num back down.  Without Filestore colocated journals, the seek thrashing 
>> may be less of an issue than it used to be.
>> 
>>> I heard in related rumors that some users have 1000+ PGs per OSD without 
>>> problems.
>> 
>> On spinners?  Or NVMe?  On a 60-120 TB NVMe OSD I’d be sorely tempted to try 
>> 500-1000.
>> 
>>> I would be very much interested in a non-rumor answer, that is, not an 
>>> answer of the form "it might use more RAM", "it might stress xyz". I don't 
>>> care what a rumor says it might do. I would like to know what it will do.
>> 
>> It WILL use more RAM.
>> 
>>> I'm looking for answers of the form "a PG per OSD requires X amount of RAM 
>>> fixed plus Y amount per object”
>> 
>> Derive the size of your map and multiple by the number of OSDs per system.  
>> My sense is that it’s on the order of MBs per OSD.  After a certain point 
>> the RAM delta might have more impact by raising osd_memory_target instead.
>> 
>>> or "searching/indexing stuff of kind A in N PGs per OSD requires N log 
>>> N/N²/... operations", "peering of N PGs per OSD requires N/N log 
>>> N/N²/N*#peers/... operations". In other words, what are the *actual* 
>>> resources required to host N PGs with M objects on an OSD (note that N*M is 
>>> a constant per OSD). With that info one could make an informed decision, 
>>> informed by facts not rumors.
>>> 
>>> An additional question of interest is: Has anyone ever observed any 
>>> detrimental effects of increasing the PG count per OSD to large values>500?
>> 
>> Consider this scenario:
>> 
>> An unmanaged lab setup used for successive OpenStack deployments, each of 
>> which created two RBD pools and the panoply of RGW pools.  Which nobody 
>> cleaned up before redeploys, so they accreted like plaque in the arteries of 
>> an omnivore.  Such that the PG ratio hits 9000.  Yes, 9000. Then the 
>> building loses power.  The systems don’t have nearly enough RAM to boot, 
>> peer, and activate, so the entire cluster has to be wiped and redeployed 
>> from scratch.  An extreme example, but remember that I don’t make stuff up.
>> 
>>> 
>>> Thanks a lot for any clarifications in this matter!
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

Reply via email to