Your logic isn't quite right and from what I understand, this is how it
works:

step choose firstn 2 type rack       # Choose two racks from the CRUSH map
(my CRUSH only has two, so select both of them)
step chooseleaf firstn 2 type host  # From the set chosen previously (two
racks), select a leaf (osd) from from 2 hosts of each rack (each of the set
returned previously).

If you have size 3, it will pick two OSDs from one rack and one from the
second (remember that the first rack in placement will sometimes be 'A' and
sometimes 'B' so the placement won't be totally unbalanced).

Where the min_size and max_size comes in could be something like this (this
is somewhat exaggerated):

Lets say that you want the minimal possible latency and highest bandwidth
and are OK with losing data (swap partitions or something). You create a
pool with size 1 and a rule like this:

rule replicated_swap {
        ruleset 0
        type replicated
        min_size 1
        max_size 1
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

Then you have a pool you want to put on some hosts that have RAID5 prtected
OSDs, so you don't need as many replications because RAID will protect from
disk failures:

rule replicated_radi5 {
        ruleset 1
        type replicated
        min_size 2
        max_size 2
        step take raid5
        step chooseleaf firstn 0 type host
        step emit
}

Then you have a pool that you want "default" protection for 3-4 copies:

rule replicated_default {
        ruleset 2
        type replicated
        min_size 3
        max_size 4
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

Then you have a pool that you absolutely can't lose data on, so you have
lots of copies and want it spread throughout the data center:

rule replicated_paranoid {
        ruleset 3
        type replicated
        min_size 5
        max_size 10
        step take default
        step chooseleaf firstn 0 type rack
        step emit
}

You then specify the rule to use for each pool. Again, the min and max size
is a selector for the rule. If the actual pool size is outside of the min
and max, then the rule should not run (I don't know if it actually does
this or is just a reminder for the human to know what sizes the rule was
intentionally written for).

On Tue, Apr 21, 2015 at 8:36 AM, Colin Corr <co...@pc-doctor.com> wrote:

>
>
> On 04/20/2015 04:18 PM, Robert LeBlanc wrote:
> > You usually won't end up with more than the "size" number of replicas,
> even in a failure situation. Although technically more than "size" number
> of OSDs may have the data (if the OSD comes back in service, the journal
> may be used to quickly get the OSD back up to speed), these would not be
> active.
> >
> > For us using size 4 and min size 2 is so that we can lose an entire rack
> (2 copies) but not block I/O. Our configuration prevents four copies in one
> rack. If we lose a rack and then an OSD in the surviving rack, write I/O to
> those placement groups groups will block until the objects have been
> replicated elsewhere in the rack, but it would not be more than 2 copies.
> >
> > I hope I'm making sense and this my jabbering is useful.
>
> Yes, it is helpful, thank you. My clarity level has been upgraded from mud
> to stained glass.
>
> If I am following the logic of your rule correctly:
>
> 1. If we have less than 2 replicas per rack, run this step:
> step choose firstn 2 type rack
> 2. If we have less than 2 replicas on our hosts in this rack, run this
> step:
> step chooseleaf firstn 2 type host
>
> I still don't understand where exactly max_size comes into play, unless
> you have some elaborate chain of rules, like mixing platter and ssd drives
> in the same pool. The documented example for this scenario is the only one
> I have found that utilizes the max_size in a meaningful way.
>
> Anyway, thanks for your help in translating from CRUSH to English.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to