I think what fits the need of Frédéric while not impacting the complexity of the tool for new users would be a list of known "gotchas" in PG counts. Like not having a Base2 count of PGs will cause each PG to be variable sized (for each PG past the last Base2, you have 2 PGs that are half the size of the others); Having less than X number of PG's for so much data on your amount of OSDs will cause balance problems; Having more than X number of objects for the PG's selected will cause issues; Having more than X number of PG's per OSD total (not just per pool) can cause high memory requirements (this is especially important for people setting up multiple RGW zones); etc.
On Thu, Apr 13, 2017 at 12:58 PM Michael Kidd <linuxk...@redhat.com> wrote: > Hello Frédéric, > Thank you very much for the input. I would like to ask for some > feedback from you, as well as the ceph-users list at large. > > The PGCalc tool was created to help steer new Ceph users in the right > direction, but it's certainly difficult to account for every possible > scenario. I'm struggling to find a way to implement something that would > work better for the scenario that you (Frédéric) describe, while still > being a useful starting point for the novice / more mainstream use cases. > I've also gotten complaints at the other end of the spectrum, that the tool > expects the user to know too much already, so accounting for the number of > objects is bound to add to this sentiment. > > As the Ceph user base expands and the use cases diverge, we are definitely > finding more edge cases that are causing pain. I'd love to make something > to help prevent these types of issues, but again, I worry about the > complexity introduced. > > With this, I see a few possible ways forward: > * Simply re-wroding the %data to be % object count -- but this seems more > abstract, again leading to more confusion of new users. > * Increase complexity of the PG Calc tool, at the risk of further > alienating novice/mainstream users > * Add a disclaimer about the tool being a base for decision making, but > that certain edge cases require adjustments to the recommended PG count > and/or ceph.conf & sysctl values. > * Add a disclaimer urging the end user to secure storage consulting if > their use case falls into certain categories or they are new to Ceph to > ensure the cluster will meet their needs. > > Having been on the storage consulting team and knowing the expertise they > have, I strongly believe that newcomers to Ceph (or new use cases inside of > established customers) should secure consulting before final decisions are > made on hardware... let alone the cluster is deployed. I know it seems a > bit self-serving to make this suggestion as I work at Red Hat, but there > is a lot on the line when any establishment is storing potentially business > critical data. > > I suspect the answer lies in a combination of the above or in something > I've not thought of. Please do weigh in as any and all suggestions are > more than welcome. > > Thanks, > Michael J. Kidd > Principal Software Maintenance Engineer > Red Hat Ceph Storage > +1 919-442-8878 <(919)%20442-8878> > > > On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass < > frederic.n...@univ-lorraine.fr> wrote: > >> >> Hi, >> >> I wanted to share a bad experience we had due to how the PG calculator >> works. >> >> When we set our production cluster months ago, we had to decide on the >> number of PGs to give to each pool in the cluster. >> As you know, the PG calc would recommended to give a lot of PGs to heavy >> pools in size, regardless the number of objects in the pools. How bad... >> >> We essentially had 3 pools to set on 144 OSDs : >> >> 1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold 80% of all >> datas in the cluster. PG calc recommended 2048 PGs. >> 2. a EC5+4 pool for zimbra's data (emails) that would hold 20% of all >> datas. PG calc recommended 512 PGs. >> 3. a replicated pool for zimbra's metadata (null size objects holding >> xattrs - used for deduplication) that would hold 0% of all datas. PG calc >> recommended 128 PGs, but we decided on 256. >> >> With 120M of objects in pool #3, as soon as we upgraded to Jewel, we hit >> the Jewel scrubbing bug (OSDs flapping). >> Before we could upgrade to patched Jewel, scrub all the cluster again >> prior to increasing the number of PGs on this pool, we had to take more >> than a hundred of snapshots (for backup/restoration purposes), with the >> number of objects still increasing in the pool. Then when a snapshot was >> removed, we hit the current Jewel snap trimming bug affecting pools with >> too many objects for the number of PGs. The only way we could stop the >> trimming was to stop OSDs resulting in PGs being degraded and not trimming >> anymore (snap trimming only happens on active+clean PGs). >> >> We're now just getting out of this hole, thanks to Nick's post regarding >> osd_snap_trim_sleep and RHCS support expertise. >> >> If the PG calc had considered not only the pools weight but also the >> number of expected objects in the pool (which we knew by that time), we >> wouldn't have it these 2 bugs. >> We hope this will help improving the ceph.com and RHCS PG calculators. >> >> Regards, >> >> Frédéric. >> >> -- >> >> Frédéric Nass >> >> Sous-direction Infrastructures >> Direction du Numérique >> Université de Lorraine >> >> Tél : +33 3 72 74 11 35 >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com