Re: [ceph-users] CRUSH rule for 3 replicas across 2 hosts

Gregory Farnum Tue, 21 Apr 2015 09:53:23 -0700

The CRUSH min and max sizes are part of the "ruleset" facilities that we're
slowly removing because they turned out to have no utility and be overly
complicated to understand. You should probably just set them all to 1 and
10.


The intention behind them was that you could have a single ruleset which
included different rules for sizes 1-3, 4-5, and 6-10 (or whatever, all
numbers made up). Then as you dynamically changed the (replication) size of
your pool it would transparently switch between the individual rules based
on their min and max sizes. But that's not really a thing people do and it
complicates a lot of the interfaces, so it's all going away.
-Greg
On Tue, Apr 21, 2015 at 9:08 AM Robert LeBlanc <rob...@leblancnet.us> wrote:

> Your logic isn't quite right and from what I understand, this is how it
> works:
>
> step choose firstn 2 type rack       # Choose two racks from the CRUSH map
> (my CRUSH only has two, so select both of them)
> step chooseleaf firstn 2 type host  # From the set chosen previously (two
> racks), select a leaf (osd) from from 2 hosts of each rack (each of the set
> returned previously).
>
> If you have size 3, it will pick two OSDs from one rack and one from the
> second (remember that the first rack in placement will sometimes be 'A' and
> sometimes 'B' so the placement won't be totally unbalanced).
>
> Where the min_size and max_size comes in could be something like this
> (this is somewhat exaggerated):
>
> Lets say that you want the minimal possible latency and highest bandwidth
> and are OK with losing data (swap partitions or something). You create a
> pool with size 1 and a rule like this:
>
> rule replicated_swap {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 1
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
>
> Then you have a pool you want to put on some hosts that have RAID5
> prtected OSDs, so you don't need as many replications because RAID will
> protect from disk failures:
>
> rule replicated_radi5 {
>         ruleset 1
>         type replicated
>         min_size 2
>         max_size 2
>         step take raid5
>         step chooseleaf firstn 0 type host
>         step emit
> }
>
> Then you have a pool that you want "default" protection for 3-4 copies:
>
> rule replicated_default {
>         ruleset 2
>         type replicated
>         min_size 3
>         max_size 4
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
>
> Then you have a pool that you absolutely can't lose data on, so you have
> lots of copies and want it spread throughout the data center:
>
> rule replicated_paranoid {
>         ruleset 3
>         type replicated
>         min_size 5
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type rack
>         step emit
> }
>
> You then specify the rule to use for each pool. Again, the min and max
> size is a selector for the rule. If the actual pool size is outside of the
> min and max, then the rule should not run (I don't know if it actually does
> this or is just a reminder for the human to know what sizes the rule was
> intentionally written for).
>
> On Tue, Apr 21, 2015 at 8:36 AM, Colin Corr <co...@pc-doctor.com> wrote:
>
>>
>>
>> On 04/20/2015 04:18 PM, Robert LeBlanc wrote:
>> > You usually won't end up with more than the "size" number of replicas,
>> even in a failure situation. Although technically more than "size" number
>> of OSDs may have the data (if the OSD comes back in service, the journal
>> may be used to quickly get the OSD back up to speed), these would not be
>> active.
>> >
>> > For us using size 4 and min size 2 is so that we can lose an entire
>> rack (2 copies) but not block I/O. Our configuration prevents four copies
>> in one rack. If we lose a rack and then an OSD in the surviving rack, write
>> I/O to those placement groups groups will block until the objects have been
>> replicated elsewhere in the rack, but it would not be more than 2 copies.
>> >
>> > I hope I'm making sense and this my jabbering is useful.
>>
>> Yes, it is helpful, thank you. My clarity level has been upgraded from
>> mud to stained glass.
>>
>> If I am following the logic of your rule correctly:
>>
>> 1. If we have less than 2 replicas per rack, run this step:
>> step choose firstn 2 type rack
>> 2. If we have less than 2 replicas on our hosts in this rack, run this
>> step:
>> step chooseleaf firstn 2 type host
>>
>> I still don't understand where exactly max_size comes into play, unless
>> you have some elaborate chain of rules, like mixing platter and ssd drives
>> in the same pool. The documented example for this scenario is the only one
>> I have found that utilizes the max_size in a meaningful way.
>>
>> Anyway, thanks for your help in translating from CRUSH to English.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH rule for 3 replicas across 2 hosts

Reply via email to