Ok, so if I understand correctly, for replication level 3 or 4 I would have to use the rule
rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take root step choose firstn 2 type datacenter step chooseleaf firstn 2 type host step emit } The question I have now is: how will it behave when a DC goes down? (Assuming catastrophic failure, the thing burns down) For example, if I set replication to 3, min_rep to 3. Then, if a DC goes down, crush will only return 2 PG's, so everything will hang (same for 4/4 and 4/3) If I set replication to 3, min_rep to 2, it could occur that all data of a PG is in one DC (degraded mode). if this DC goes down, the PG will hang,.... As far as I know, degraded PG's will still accept writes, so data loss is possible. (same for 4/2) I can't seem to find a way around this. What am I missing. Wouter On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfar...@redhat.com> wrote: > On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.debor...@gmail.com> > wrote: > > Hi all, > > > > I have found on the mailing list that it should be possible to have a > multi > > datacenter setup, if latency is low enough. > > > > I would like to set this up, so that each datacenter has at least two > > replicas and each PG has a replication level of 3. > > > > In this mail, it is suggested that I should use the following crush map > for > > multi DC: > > > > rule dc { > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take default > > step chooseleaf firstn 0 type datacenter > > step emit > > } > > > > This looks suspicious to me, as it will only generate a list of two PG's, > > (and only one PG if one DC is down). > > > > I think I should use: > > > > rule replicated_ruleset { > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take root > > step choose firstn 2 type datacenter > > step chooseleaf firstn 2 type host > > step emit > > step take root > > step chooseleaf firstn -4 type host > > step emit > > } > > > > This correctly generates a list with 2 PG's in one DC, then 2 PG's in the > > other and then a list of PG's > > > > The problem is that this list contains duplicates (e.g. for 8 OSDS per > DC) > > > > [13,11,1,8,13,11,16,4,3,7] > > [9,2,13,11,9,15,12,18,3,5] > > [3,5,17,10,3,5,7,13,18,10] > > [7,6,11,14,7,14,3,16,4,11] > > [6,3,15,18,6,3,12,9,16,15] > > > > Will this be a problem? > > For replicated pools, it probably will cause trouble. For EC pools I > think it should work fine, but obviously you're losing all kinds of > redundancy. Nothing in the system will do work to avoid colocating > them if you use a rule like this. Rather than distributing some of the > replicas randomly across DCs, you really just want to split them up > evenly across datacenters (or in some ratio, if one has more space > than the other). Given CRUSH's current abilities that does require > building the replication size into the rule, but such is life. > > > > If crush is executed, will it only consider osd's which are (up,in) or > all > > OSD's in the map and then filter them from the list afterwards? > > CRUSH will consider all OSDs, but if it selects any OSDs which are out > then it retries until it gets one that is still marked in. > -Greg >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com