Re: [ceph-users] PG calculator improvement

David Turner Thu, 13 Apr 2017 10:09:07 -0700

I think what fits the need of Frédéric while not impacting the complexity
of the tool for new users would be a list of known "gotchas" in PG counts.
Like not having a Base2 count of PGs will cause each PG to be variable
sized (for each PG past the last Base2, you have 2 PGs that are half the
size of the others); Having less than X number of PG's for so much data on
your amount of OSDs will cause balance problems; Having more than X number
of objects for the PG's selected will cause issues; Having more than X
number of PG's per OSD total (not just per pool) can cause high memory
requirements (this is especially important for people setting up multiple
RGW zones); etc.


On Thu, Apr 13, 2017 at 12:58 PM Michael Kidd <linuxk...@redhat.com> wrote:

> Hello Frédéric,
>   Thank you very much for the input.  I would like to ask for some
> feedback from you, as well as the ceph-users list at large.
>
> The PGCalc tool was created to help steer new Ceph users in the right
> direction, but it's certainly difficult to account for every possible
> scenario.  I'm struggling to find a way to implement something that would
> work better for the scenario that you (Frédéric) describe, while still
> being a useful starting point for the novice / more mainstream use cases.
> I've also gotten complaints at the other end of the spectrum, that the tool
> expects the user to know too much already, so accounting for the number of
> objects is bound to add to this sentiment.
>
> As the Ceph user base expands and the use cases diverge, we are definitely
> finding more edge cases that are causing pain.  I'd love to make something
> to help prevent these types of issues, but again, I worry about the
> complexity introduced.
>
> With this, I see a few possible ways forward:
> * Simply re-wroding the %data to be % object count -- but this seems more
> abstract, again leading to more confusion of new users.
> * Increase complexity of the PG Calc tool, at the risk of further
> alienating novice/mainstream users
> * Add a disclaimer about the tool being a base for decision making, but
> that certain edge cases require adjustments to the recommended PG count
> and/or ceph.conf & sysctl values.
> * Add a disclaimer urging the end user to secure storage consulting if
> their use case falls into certain categories or they are new to Ceph to
> ensure the cluster will meet their needs.
>
> Having been on the storage consulting team and knowing the expertise they
> have, I strongly believe that newcomers to Ceph (or new use cases inside of
> established customers) should secure consulting before final decisions are
> made on hardware... let alone the cluster is deployed.  I know it seems a
> bit self-serving to make this suggestion as I work at Red Hat, but there
> is a lot on the line when any establishment is storing potentially business
> critical data.
>
> I suspect the answer lies in a combination of the above or in something
> I've not thought of.  Please do weigh in as any and all suggestions are
> more than welcome.
>
> Thanks,
> Michael J. Kidd
> Principal Software Maintenance Engineer
> Red Hat Ceph Storage
> +1 919-442-8878 <(919)%20442-8878>
>
>
> On Wed, Apr 12, 2017 at 6:35 AM, Frédéric Nass <
> frederic.n...@univ-lorraine.fr> wrote:
>
>>
>> Hi,
>>
>> I wanted to share a bad experience we had due to how the PG calculator
>> works.
>>
>> When we set our production cluster months ago, we had to decide on the
>> number of PGs to give to each pool in the cluster.
>> As you know, the PG calc would recommended to give a lot of PGs to heavy
>> pools in size, regardless the number of objects in the pools. How bad...
>>
>> We essentially had 3 pools to set on 144 OSDs :
>>
>> 1. a EC5+4 pool for the radosGW (.rgw.buckets) that would hold 80% of all
>> datas in the cluster. PG calc recommended 2048 PGs.
>> 2. a EC5+4 pool for zimbra's data (emails) that would hold 20% of all
>> datas. PG calc recommended 512 PGs.
>> 3. a replicated pool for zimbra's metadata (null size objects holding
>> xattrs - used for deduplication) that would hold 0% of all datas. PG calc
>> recommended 128 PGs, but we decided on 256.
>>
>> With 120M of objects in pool #3, as soon as we upgraded to Jewel, we hit
>> the Jewel scrubbing bug (OSDs flapping).
>> Before we could upgrade to patched Jewel, scrub all the cluster again
>> prior to increasing the number of PGs on this pool, we had to take more
>> than a hundred of snapshots (for backup/restoration purposes), with the
>> number of objects still increasing in the pool. Then when a snapshot was
>> removed, we hit the current Jewel snap trimming bug affecting pools with
>> too many objects for the number of PGs. The only way we could stop the
>> trimming was to stop OSDs resulting in PGs being degraded and not trimming
>> anymore (snap trimming only happens on active+clean PGs).
>>
>> We're now just getting out of this hole, thanks to Nick's post regarding
>> osd_snap_trim_sleep and RHCS support expertise.
>>
>> If the PG calc had considered not only the pools weight but also the
>> number of expected objects in the pool (which we knew by that time), we
>> wouldn't have it these 2 bugs.
>> We hope this will help improving the ceph.com and RHCS PG calculators.
>>
>> Regards,
>>
>> Frédéric.
>>
>> --
>>
>> Frédéric Nass
>>
>> Sous-direction Infrastructures
>> Direction du Numérique
>> Université de Lorraine
>>
>> Tél : +33 3 72 74 11 35
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG calculator improvement

Reply via email to