[ceph-users] Re: What is the problem with many PGs per OSD

Anthony D'Atri Wed, 09 Oct 2024 06:53:43 -0700

> Unfortunately, it doesn't really help answering my questions either.


 Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we 
couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, so 
a certain fear of overshooting was established.  Mark is the go-to here.

> That's why deploying multiple OSDs per SSD is such a great way to improve 
> performance on devices where 4K random IO throughput scales with iodepth.

Mark’s testing have shown this to not be so much the case with recent releases 
— do you still see this?  Until recently I was expecting 30TB TLC SSDs for RBD, 
and in the next year perhaps as large as 122T for object so I was thinking of 
splitting just because of the size - and the systems in question were 
overequipped with CPU.


> Memory: I have never used file store, so can't relate to that.

XFS - I experienced a lot of ballooning, to the point of OOMkilling.  In mixed 
clusters under duress the BlueStore OSDs consistently behaved better.

> 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? 
> How many OSDs per host?

Those were Cisco UCS… C240m3.  Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs @ 
3TB, 64GB I think.

> Did it even work with 200PGs with the same data (recovery after power loss)?

I didn’t have remote power control, and being a shared lab it was difficult to 
take a cluster down for such testing.  We did have a larger integration cluster 
(450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power drop.  Ceph 
was fine (this was …. Firefly I think) but the LSI RoC HBAs lost data like 
crazy due to hardware, firmware, and utility bugs.

> Was it maybe the death spiral 
> (https://ceph-users.ceph.narkive.com/KAzvjjPc/explanation-for-ceph-osd-set-nodown-and-ceph-osd-cluster-snap)
>  that prevented the cluster from coming up and not so much the PG count?

Not in this case, though I’ve seen a similar cascading issue in another context.

> Rumors: Yes, 1000 PGs/OSD on spinners without issues. I guess we are not 
> talking about barely working home systems with lack of all sorts of resources 
> here.

I’d be curious how such systems behave under duress.  I’ve seen a cluster that 
had grown - the mons ended up with enough RAM to run but not to boot, so I did 
urgent RAM upgrades on the mons.  That was the mixed Filestore / BlueStore 
cluster (Luminous 12.2.2) where the Filestore OSDs were much more affected by a 
cascading event than the [mostly larger] BlueStore OSDs.  I suspect that had 
the whole cluster been BlueStore it might not have cascaded.

> 
> The goal: Let's say I want to go 500-1000PGs/OSD on 16T spinners to trim PGs 
> to about 10-20G each. What are the resources that count will require compared 
> with, say, 200 PGs/OSD? That's the interesting question and if I can make the 
> resources available I would consider doing that.

The proof is in the proverbial pudding.  Bump up pg_num on pools and see how 
the average / P90 ceph-osd process size changes?  Grafana FTW.  
osd_map_cache_size I think defaults to 50 now; I want to say it used to be much 
higher.



> 
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Anthony D'Atri <a...@dreamsnake.net>
> Sent: Wednesday, October 9, 2024 2:40 AM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] What is the problem with many PGs per OSD
> 
> I’ve sprinkled minimizers below.  Free advice and worth every penny.  ymmv.  
> Do not taunt Happy Fun Ball.
> 
> 
>> during a lot of discussions in the past the comment that having "many PGs 
>> per OSD can lead to issues" came up without ever explaining what these 
>> issues will (not might!) be or how one would notice. It comes up as kind of 
>> a rumor without any factual or even anecdotal backing.
> 
> A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 
> to help avoid OOMing, the idea being that more PGs = more RAM usage on each 
> daemon that stores the maps.  With BlueStore’s osd_memory_target, my sense is 
> that the ballooning seen with Filestore is much less of an issue.
> 
>> As far as I can tell from experience, any increase of resource utilization 
>> due to an increase of the PG count per OSD is more than offset by the 
>> performance impact of the reduced size of the PGs. Everything seems to 
>> benefit from smaller PGs, recovery, user IO, scrubbing.
> 
> My understanding is that there is serialization in the PG code, and thus the 
> PG ratio can be thought of as the degree of parallelism the OSD device can 
> handle.  SAS/SATA SSDs don’t seek so they can handle more than HDDS, and NVMe 
> devices can handle more than SAS/SATA.
> 
>> Yet, I'm holding back on an increase of PG count due to these rumors.
> 
> My personal sense:
> 
> HDD OSD:  PG ratio 100-200
> SATA/SAS SSD OSD: 200-300
> NVMe SSD OSD: 300-400
> 
> These are not empirical figures.  ymmv.
> 
> 
>> My situation: I would like to split PGs on large HDDs. Currently, we have on 
>> average 135PGs per OSD and I would like to go for something like 450.
> 
> The good Mr. Nelson may have more precise advice, but my personal sense is 
> that I wouldn’t go higher than 200 on an HDD.  If you were at like 20 (I’ve 
> seen it!) that would be a different story, my sense is that there are 
> diminishing returns over say 150.  Seek thrashing fu, elevator scheduling fu, 
> op re-ordering fu, etc.  Assuming you’re on Nautilus or later, it doesn’t 
> hurt to experiment with your actual workload since you can scale pg_num back 
> down.  Without Filestore colocated journals, the seek thrashing may be less 
> of an issue than it used to be.
> 
>> I heard in related rumors that some users have 1000+ PGs per OSD without 
>> problems.
> 
> On spinners?  Or NVMe?  On a 60-120 TB NVMe OSD I’d be sorely tempted to try 
> 500-1000.
> 
>> I would be very much interested in a non-rumor answer, that is, not an 
>> answer of the form "it might use more RAM", "it might stress xyz". I don't 
>> care what a rumor says it might do. I would like to know what it will do.
> 
> It WILL use more RAM.
> 
>> I'm looking for answers of the form "a PG per OSD requires X amount of RAM 
>> fixed plus Y amount per object”
> 
> Derive the size of your map and multiple by the number of OSDs per system.  
> My sense is that it’s on the order of MBs per OSD.  After a certain point the 
> RAM delta might have more impact by raising osd_memory_target instead.
> 
>> or "searching/indexing stuff of kind A in N PGs per OSD requires N log 
>> N/N²/... operations", "peering of N PGs per OSD requires N/N log 
>> N/N²/N*#peers/... operations". In other words, what are the *actual* 
>> resources required to host N PGs with M objects on an OSD (note that N*M is 
>> a constant per OSD). With that info one could make an informed decision, 
>> informed by facts not rumors.
>> 
>> An additional question of interest is: Has anyone ever observed any 
>> detrimental effects of increasing the PG count per OSD to large values>500?
> 
> Consider this scenario:
> 
> An unmanaged lab setup used for successive OpenStack deployments, each of 
> which created two RBD pools and the panoply of RGW pools.  Which nobody 
> cleaned up before redeploys, so they accreted like plaque in the arteries of 
> an omnivore.  Such that the PG ratio hits 9000.  Yes, 9000. Then the building 
> loses power.  The systems don’t have nearly enough RAM to boot, peer, and 
> activate, so the entire cluster has to be wiped and redeployed from scratch.  
> An extreme example, but remember that I don’t make stuff up.
> 
>> 
>> Thanks a lot for any clarifications in this matter!
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: What is the problem with many PGs per OSD

Reply via email to