Re: [ceph-users] multi-datacenter crush map

Wouter De Borger Sat, 19 Sep 2015 11:54:35 -0700

Ok, so if I understand correctly, for replication level 3 or 4 I would have
to use the rule


rule replicated_ruleset {

    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take root
    step choose firstn 2 type datacenter
    step chooseleaf firstn 2 type host
    step emit
}

The question I have now is: how will it behave when a DC goes down?
(Assuming catastrophic failure, the thing burns down)

For example, if I set replication to 3, min_rep to 3.
Then, if a DC goes down, crush will only return 2 PG's, so everything will
hang  (same for 4/4 and 4/3)

If I set replication to 3, min_rep to 2, it could occur that all data of a
PG is in one DC (degraded mode). if this DC goes down, the PG will hang,....
As far as I know, degraded PG's will still accept writes, so data loss is
possible. (same for 4/2)



I can't seem to find a way around this. What am I missing.


Wouter




On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.debor...@gmail.com>
> wrote:
> > Hi all,
> >
> > I have found on the mailing list that it should be possible to have a
> multi
> > datacenter setup, if latency is low enough.
> >
> > I would like to set this up, so that each datacenter has at least two
> > replicas and each PG has a replication level of 3.
> >
> > In this mail, it is suggested that I should use the following crush map
> for
> > multi DC:
> >
> > rule dc {
> >     ruleset 0
> >     type replicated
> >     min_size 1
> >     max_size 10
> >     step take default
> >     step chooseleaf firstn 0 type datacenter
> >     step emit
> > }
> >
> > This looks suspicious to me, as it will only generate a list of two PG's,
> > (and only one PG if one DC is down).
> >
> > I think I should use:
> >
> > rule replicated_ruleset {
> >     ruleset 0
> >     type replicated
> >     min_size 1
> >     max_size 10
> >     step take root
> >     step choose firstn 2 type datacenter
> >     step chooseleaf firstn 2 type host
> >     step emit
> >     step take root
> >     step chooseleaf firstn -4 type host
> >     step emit
> > }
> >
> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in the
> > other and then a list of PG's
> >
> > The problem is that this list contains duplicates (e.g. for 8 OSDS per
> DC)
> >
> > [13,11,1,8,13,11,16,4,3,7]
> > [9,2,13,11,9,15,12,18,3,5]
> > [3,5,17,10,3,5,7,13,18,10]
> > [7,6,11,14,7,14,3,16,4,11]
> > [6,3,15,18,6,3,12,9,16,15]
> >
> > Will this be a problem?
>
> For replicated pools, it probably will cause trouble. For EC pools I
> think it should work fine, but obviously you're losing all kinds of
> redundancy. Nothing in the system will do work to avoid colocating
> them if you use a rule like this. Rather than distributing some of the
> replicas randomly across DCs, you really just want to split them up
> evenly across datacenters (or in some ratio, if one has more space
> than the other). Given CRUSH's current abilities that does require
> building the replication size into the rule, but such is life.
>
>
> > If crush is executed, will it only consider osd's which are (up,in)  or
> all
> > OSD's in the map and then filter them from the list afterwards?
>
> CRUSH will consider all OSDs, but if it selects any OSDs which are out
> then it retries until it gets one that is still marked in.
> -Greg
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] multi-datacenter crush map

Reply via email to