> Hi Anthony. > >> ... Bump up pg_num on pools and see how the average / P90 ceph-osd process >> size changes? >> Grafana FTW. osd_map_cache_size I think defaults to 50 now; I want to say >> it used to be much higher. > > That's not an option. What would help is a-priori information based on the > implementation
I think with so many variables in play that would be tough to quantify. > I'm looking at a pool with 5PB of data and 8192 PGs. If I increase that by, > say a factor 4, its in one step, not gradual to avoid excessive redundant > data movement. I don't want to spend hardware life for nothing and also don't > want to wait for months or more for this to complete or get stuck along the > way due to something catastrophic. Used to be that Ceph wouldn’t let you more than double pg_num in one step. You might consider going to just, say, 9216 and see what happens. Non power of 2 pg_num isn’t THAT big a deal these days, you’ll end up some some PGs larger than others, but it’s not horrible for a short term. My sense re hardware life is that writes due to rebalancing are trivial. > > What I would like to know is is there a fundamental scaling limit in the PG > implementation that someone who was staring at the code for a long time knows > about. This is usually something that grows much worse than N log N in time- > or memory complexity. The answer to this is in the code and boils down to > "why the recommendation of 100 PGs per OSD" and not 200 or 1000 or 100 per TB > - the latter would make a looooot more sense). There ought to be a reason > other than "we didn't know what else to write". > > I would like to know the scaling in (worst-case) complexity as a function of > the number of PGs. Making a fixed recommendation of a specific number > independent of anything else is something really weird. It indicates that > there is something catastrophic in the code that will blow up once an > (unknown/undocumented!!) threshold is crossed. For example, a tiny but > important function that is exponential in the number of PGs. If there is > nothing catastrophic in the code, then why is the recommendation not > floating, specifying what increase in resource consumption one should expect. > > None of the discussions I have seen so far address this extreme weirdness of > the recommendation. If there is an unsolved scaling problem, please anyone > state what it is, why its there and what the critical threshold is. What part > of the code will explode? > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Anthony D'Atri <anthony.da...@gmail.com> > Sent: Wednesday, October 9, 2024 3:52 PM > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] What is the problem with many PGs per OSD > >> Unfortunately, it doesn't really help answering my questions either. > > > Sometimes the best we can do is grunt and shrug :-/. Before Nautilus we > couldn’t merge PGs, so we could raise pg_num for a pool but not decrease it, > so a certain fear of overshooting was established. Mark is the go-to here. > >> That's why deploying multiple OSDs per SSD is such a great way to improve >> performance on devices where 4K random IO throughput scales with iodepth. > > Mark’s testing have shown this to not be so much the case with recent > releases — do you still see this? Until recently I was expecting 30TB TLC > SSDs for RBD, and in the next year perhaps as large as 122T for object so I > was thinking of splitting just because of the size - and the systems in > question were overequipped with CPU. > > >> Memory: I have never used file store, so can't relate to that. > > XFS - I experienced a lot of ballooning, to the point of OOMkilling. In > mixed clusters under duress the BlueStore OSDs consistently behaved better. > >> 9000 PGs/OSD was too much for what kind of system? What CPU? How much RAM? >> How many OSDs per host? > > Those were Cisco UCS… C240m3. Dual 16c Sandy Bridge IIRC, 10x SATA HDD OSDs > @ 3TB, 64GB I think. > >> Did it even work with 200PGs with the same data (recovery after power loss)? > > I didn’t have remote power control, and being a shared lab it was difficult > to take a cluster down for such testing. We did have a larger integration > cluster (450 OSDs) with a PG ratio of ~~ 200 where we tested a rack power > drop. Ceph was fine (this was …. Firefly I think) but the LSI RoC HBAs lost > data like crazy due to hardware, firmware, and utility bugs. > >> Was it maybe the death spiral >> (https://ceph-users.ceph.narkive.com/KAzvjjPc/explanation-for-ceph-osd-set-nodown-and-ceph-osd-cluster-snap) >> that prevented the cluster from coming up and not so much the PG count? > > Not in this case, though I’ve seen a similar cascading issue in another > context. > >> Rumors: Yes, 1000 PGs/OSD on spinners without issues. I guess we are not >> talking about barely working home systems with lack of all sorts of >> resources here. > > I’d be curious how such systems behave under duress. I’ve seen a cluster > that had grown - the mons ended up with enough RAM to run but not to boot, so > I did urgent RAM upgrades on the mons. That was the mixed Filestore / > BlueStore cluster (Luminous 12.2.2) where the Filestore OSDs were much more > affected by a cascading event than the [mostly larger] BlueStore OSDs. I > suspect that had the whole cluster been BlueStore it might not have cascaded. > >> >> The goal: Let's say I want to go 500-1000PGs/OSD on 16T spinners to trim PGs >> to about 10-20G each. What are the resources that count will require >> compared with, say, 200 PGs/OSD? That's the interesting question and if I >> can make the resources available I would consider doing that. > > The proof is in the proverbial pudding. Bump up pg_num on pools and see how > the average / P90 ceph-osd process size changes? Grafana FTW. > osd_map_cache_size I think defaults to 50 now; I want to say it used to be > much higher. > > > >> >> Thanks and best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Anthony D'Atri <a...@dreamsnake.net> >> Sent: Wednesday, October 9, 2024 2:40 AM >> To: Frank Schilder >> Cc: ceph-users@ceph.io >> Subject: Re: [ceph-users] What is the problem with many PGs per OSD >> >> I’ve sprinkled minimizers below. Free advice and worth every penny. ymmv. >> Do not taunt Happy Fun Ball. >> >> >>> during a lot of discussions in the past the comment that having "many PGs >>> per OSD can lead to issues" came up without ever explaining what these >>> issues will (not might!) be or how one would notice. It comes up as kind of >>> a rumor without any factual or even anecdotal backing. >> >> A handful of years ago Sage IIRC retconned PG ratio guidance from 200 to 100 >> to help avoid OOMing, the idea being that more PGs = more RAM usage on each >> daemon that stores the maps. With BlueStore’s osd_memory_target, my sense >> is that the ballooning seen with Filestore is much less of an issue. >> >>> As far as I can tell from experience, any increase of resource utilization >>> due to an increase of the PG count per OSD is more than offset by the >>> performance impact of the reduced size of the PGs. Everything seems to >>> benefit from smaller PGs, recovery, user IO, scrubbing. >> >> My understanding is that there is serialization in the PG code, and thus the >> PG ratio can be thought of as the degree of parallelism the OSD device can >> handle. SAS/SATA SSDs don’t seek so they can handle more than HDDS, and >> NVMe devices can handle more than SAS/SATA. >> >>> Yet, I'm holding back on an increase of PG count due to these rumors. >> >> My personal sense: >> >> HDD OSD: PG ratio 100-200 >> SATA/SAS SSD OSD: 200-300 >> NVMe SSD OSD: 300-400 >> >> These are not empirical figures. ymmv. >> >> >>> My situation: I would like to split PGs on large HDDs. Currently, we have >>> on average 135PGs per OSD and I would like to go for something like 450. >> >> The good Mr. Nelson may have more precise advice, but my personal sense is >> that I wouldn’t go higher than 200 on an HDD. If you were at like 20 (I’ve >> seen it!) that would be a different story, my sense is that there are >> diminishing returns over say 150. Seek thrashing fu, elevator scheduling >> fu, op re-ordering fu, etc. Assuming you’re on Nautilus or later, it >> doesn’t hurt to experiment with your actual workload since you can scale >> pg_num back down. Without Filestore colocated journals, the seek thrashing >> may be less of an issue than it used to be. >> >>> I heard in related rumors that some users have 1000+ PGs per OSD without >>> problems. >> >> On spinners? Or NVMe? On a 60-120 TB NVMe OSD I’d be sorely tempted to try >> 500-1000. >> >>> I would be very much interested in a non-rumor answer, that is, not an >>> answer of the form "it might use more RAM", "it might stress xyz". I don't >>> care what a rumor says it might do. I would like to know what it will do. >> >> It WILL use more RAM. >> >>> I'm looking for answers of the form "a PG per OSD requires X amount of RAM >>> fixed plus Y amount per object” >> >> Derive the size of your map and multiple by the number of OSDs per system. >> My sense is that it’s on the order of MBs per OSD. After a certain point >> the RAM delta might have more impact by raising osd_memory_target instead. >> >>> or "searching/indexing stuff of kind A in N PGs per OSD requires N log >>> N/N²/... operations", "peering of N PGs per OSD requires N/N log >>> N/N²/N*#peers/... operations". In other words, what are the *actual* >>> resources required to host N PGs with M objects on an OSD (note that N*M is >>> a constant per OSD). With that info one could make an informed decision, >>> informed by facts not rumors. >>> >>> An additional question of interest is: Has anyone ever observed any >>> detrimental effects of increasing the PG count per OSD to large values>500? >> >> Consider this scenario: >> >> An unmanaged lab setup used for successive OpenStack deployments, each of >> which created two RBD pools and the panoply of RGW pools. Which nobody >> cleaned up before redeploys, so they accreted like plaque in the arteries of >> an omnivore. Such that the PG ratio hits 9000. Yes, 9000. Then the >> building loses power. The systems don’t have nearly enough RAM to boot, >> peer, and activate, so the entire cluster has to be wiped and redeployed >> from scratch. An extreme example, but remember that I don’t make stuff up. >> >>> >>> Thanks a lot for any clarifications in this matter! >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io