> Unfortunately, it doesn't really help answering my questions either.
Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, so a certain fear of overshooting was established. Mark is the go-to here. > That's why deploying multiple OSDs per SSD is such a great way to improve > performance on devices where 4K random IO throughput scales with iodepth. Mark’s testing have shown this to not be so much the case with recent releases — do you still see this? Until recently I was expecting 30TB TLC SSDs for RBD, and in the next year perhaps as large as 122T for object so I was thinking of splitting just because of the size - and the systems in question were overequipped with CPU. > Memory: I have never used file store, so can't relate to that. XFS - I experienced a lot of ballooning, to the point of OOMkilling. In mixed clusters under duress the BlueStore OSDs consistently behaved better. > 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? > How many OSDs per host? Those were Cisco UCS… C240m3. Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs @ 3TB, 64GB I think. > Did it even work with 200PGs with the same data (recovery after power loss)? I didn’t have remote power control, and being a shared lab it was difficult to take a cluster down for such testing. We did have a larger integration cluster (450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power drop. Ceph was fine (this was …. Firefly I think) but the LSI RoC HBAs lost data like crazy due to hardware, firmware, and utility bugs. > Was it maybe the death spiral > (https://ceph-users.ceph.narkive.com/KAzvjjPc/explanation-for-ceph-osd-set-nodown-and-ceph-osd-cluster-snap) > that prevented the cluster from coming up and not so much the PG count? Not in this case, though I’ve seen a similar cascading issue in another context. > Rumors: Yes, 1000 PGs/OSD on spinners without issues. I guess we are not > talking about barely working home systems with lack of all sorts of resources > here. I’d be curious how such systems behave under duress. I’ve seen a cluster that had grown - the mons ended up with enough RAM to run but not to boot, so I did urgent RAM upgrades on the mons. That was the mixed Filestore / BlueStore cluster (Luminous 12.2.2) where the Filestore OSDs were much more affected by a cascading event than the [mostly larger] BlueStore OSDs. I suspect that had the whole cluster been BlueStore it might not have cascaded. > > The goal: Let's say I want to go 500-1000PGs/OSD on 16T spinners to trim PGs > to about 10-20G each. What are the resources that count will require compared > with, say, 200 PGs/OSD? That's the interesting question and if I can make the > resources available I would consider doing that. The proof is in the proverbial pudding. Bump up pg_num on pools and see how the average / P90 ceph-osd process size changes? Grafana FTW. osd_map_cache_size I think defaults to 50 now; I want to say it used to be much higher. > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Anthony D'Atri <a...@dreamsnake.net> > Sent: Wednesday, October 9, 2024 2:40 AM > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] What is the problem with many PGs per OSD > > I’ve sprinkled minimizers below. Free advice and worth every penny. ymmv. > Do not taunt Happy Fun Ball. > > >> during a lot of discussions in the past the comment that having "many PGs >> per OSD can lead to issues" came up without ever explaining what these >> issues will (not might!) be or how one would notice. It comes up as kind of >> a rumor without any factual or even anecdotal backing. > > A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 > to help avoid OOMing, the idea being that more PGs = more RAM usage on each > daemon that stores the maps. With BlueStore’s osd_memory_target, my sense is > that the ballooning seen with Filestore is much less of an issue. > >> As far as I can tell from experience, any increase of resource utilization >> due to an increase of the PG count per OSD is more than offset by the >> performance impact of the reduced size of the PGs. Everything seems to >> benefit from smaller PGs, recovery, user IO, scrubbing. > > My understanding is that there is serialization in the PG code, and thus the > PG ratio can be thought of as the degree of parallelism the OSD device can > handle. SAS/SATA SSDs don’t seek so they can handle more than HDDS, and NVMe > devices can handle more than SAS/SATA. > >> Yet, I'm holding back on an increase of PG count due to these rumors. > > My personal sense: > > HDD OSD: PG ratio 100-200 > SATA/SAS SSD OSD: 200-300 > NVMe SSD OSD: 300-400 > > These are not empirical figures. ymmv. > > >> My situation: I would like to split PGs on large HDDs. Currently, we have on >> average 135PGs per OSD and I would like to go for something like 450. > > The good Mr. Nelson may have more precise advice, but my personal sense is > that I wouldn’t go higher than 200 on an HDD. If you were at like 20 (I’ve > seen it!) that would be a different story, my sense is that there are > diminishing returns over say 150. Seek thrashing fu, elevator scheduling fu, > op re-ordering fu, etc. Assuming you’re on Nautilus or later, it doesn’t > hurt to experiment with your actual workload since you can scale pg_num back > down. Without Filestore colocated journals, the seek thrashing may be less > of an issue than it used to be. > >> I heard in related rumors that some users have 1000+ PGs per OSD without >> problems. > > On spinners? Or NVMe? On a 60-120 TB NVMe OSD I’d be sorely tempted to try > 500-1000. > >> I would be very much interested in a non-rumor answer, that is, not an >> answer of the form "it might use more RAM", "it might stress xyz". I don't >> care what a rumor says it might do. I would like to know what it will do. > > It WILL use more RAM. > >> I'm looking for answers of the form "a PG per OSD requires X amount of RAM >> fixed plus Y amount per object” > > Derive the size of your map and multiple by the number of OSDs per system. > My sense is that it’s on the order of MBs per OSD. After a certain point the > RAM delta might have more impact by raising osd_memory_target instead. > >> or "searching/indexing stuff of kind A in N PGs per OSD requires N log >> N/N²/... operations", "peering of N PGs per OSD requires N/N log >> N/N²/N*#peers/... operations". In other words, what are the *actual* >> resources required to host N PGs with M objects on an OSD (note that N*M is >> a constant per OSD). With that info one could make an informed decision, >> informed by facts not rumors. >> >> An additional question of interest is: Has anyone ever observed any >> detrimental effects of increasing the PG count per OSD to large values>500? > > Consider this scenario: > > An unmanaged lab setup used for successive OpenStack deployments, each of > which created two RBD pools and the panoply of RGW pools. Which nobody > cleaned up before redeploys, so they accreted like plaque in the arteries of > an omnivore. Such that the PG ratio hits 9000. Yes, 9000. Then the building > loses power. The systems don’t have nearly enough RAM to boot, peer, and > activate, so the entire cluster has to be wiped and redeployed from scratch. > An extreme example, but remember that I don’t make stuff up. > >> >> Thanks a lot for any clarifications in this matter! >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io