Thank you for your answer! We will use size=4 and min_size=2, which should do the trick.
For the monitor issue, we have a third datacenter (with higher latency, but that shouldn't be a problem for the monitors) We had also considered the locality issue. Our WAN round trip latency is 1.5 ms (now) and we should get a dedicated light path in the near future (<0.1 ms). So we hope to get acceptable latency without additional tweaking. The plan B is to make two pools, with different weights for the different DC's. VM's in DC 1 will get high weight for DC 1, VM's in DC 2 will get high weight for DC 2. Thanks, Wouter On Sat, Sep 19, 2015 at 9:31 PM, Robert LeBlanc <rob...@leblancnet.us> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > You will want size=4 min_size=2 if you want to keep I/O going if a DC > fails and ensure some data integrity. Data checksumming (I think is > being added) would provide much stronger data integrity checking in a > two copy situation as you would be able to tell which of the two > copies is the good copy instead of needing a third to break the tie. > > However, you have yet another problem on your hands. The way monitors > works makes this tricky. If you have one monitor in one DC and two in > the other, if the two monitor DC burns down, the surviving cluster > stops working too because there isn't more than 50% of the monitors > available. Putting two monitors in each DC only causes both to stop > working if one goes down (you need three to make a quorum). It has > been suggested that putting the odd monitor in the cloud (or other > off-site location to both DCs) could be an option, but latency could > cause problems. The cloud monitor would complete the quorum with > whichever DC survives. > > Also remember that there is no data locality awareness in Ceph at the > moment. This could mean that the primary for a PG is in the other DC. > So your client has to contact the primary in the other DC, then that > OSD contacts one OSD in it's DC and two in the other and has to get > confirmation that the write is acknowledged then ack the write to the > client. For a write you will be between 2 x ( LAN latency + WAN > latency ) and 2 x ( LAN latency + 2 x WAN latency ). Additionally your > reads will be between 2 x LAN latency and 2 x WAN latency. Then there > is write amplification so you need to make sure you have a lot more > WAN bandwidth than you think you need. > > I think the large majority of us are eagerly waiting for the RBD > replication feature or some sort of lag behind OSD for situations like > this. > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v1.1.0 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJV/bgrCRDmVDuy+mK58QAAOTsQAJd5Q/3uVMP6D0U+iZv/ > FGvEThfxLqarEo/n/TAPiJdCeZP9sKr8szTP72Iajt5UAwH8Ry5qcvClUoet > LmMXfOxHJQJcMbXcKHxI8G7w9h/8ExkGA3GkoBYltUvZ9+oEI30ANHZphBiK > HhaLWanrEKh8L4EbXnqA9JvEYwf1BGDvxKbdvFDNSIIbDywN3DJn7OavRhC9 > M63GQnFmxSO6F+Oy1q5vMfpur/VtZ27GRfzIDsougRTmM5q9zbdpSY8pHrrZ > RDExkM1t0orl1gUnbNhl/YgQTGfU/XWpEKtJju7Wk9Ciem5SFczJRWsputHc > AhBtnxBoEInlsnpHKnCsPvbY8wEcoo+YxNt79/M3cR8x0UzXl+/4SoDlYnSK > X3afL/YmVnbCV6hoxl2LAOqHbTYasN9VxQIbpQe4kAzSq45yJX//k8NRXBfD > +hGF8qfxpcbTe/9IjJiqwe+ZpaAd4vX7Xfq4oHHeMwWUrvd8sXSbr5CIV1AJ > CYsixEy2gJ0oFFVKcBGtzAfBUxJHb/FAcAuV97zSdYyYRplMq5Qjaz/hwGeu > 9pC83kxY40pfzdD9uEElWoI3+6/34LdNo4TLi3IM8aeZmNGzzIgt/MxAuFOk > 9Jf2Dwmab0+Ut6uJasY4Fr6HiyNoeTXea+CSWrnvsMohOseyJg996GUP3gUl > OEoA > =PfhN > -----END PGP SIGNATURE----- > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Sat, Sep 19, 2015 at 12:54 PM, Wouter De Borger <w.debor...@gmail.com> > wrote: > > Ok, so if I understand correctly, for replication level 3 or 4 I would > have > > to use the rule > > > > rule replicated_ruleset { > > > > ruleset 0 > > type replicated > > min_size 1 > > max_size 10 > > step take root > > step choose firstn 2 type datacenter > > step chooseleaf firstn 2 type host > > step emit > > } > > > > The question I have now is: how will it behave when a DC goes down? > > (Assuming catastrophic failure, the thing burns down) > > > > For example, if I set replication to 3, min_rep to 3. > > Then, if a DC goes down, crush will only return 2 PG's, so everything > will > > hang (same for 4/4 and 4/3) > > > > If I set replication to 3, min_rep to 2, it could occur that all data of > a > > PG is in one DC (degraded mode). if this DC goes down, the PG will > hang,.... > > As far as I know, degraded PG's will still accept writes, so data loss is > > possible. (same for 4/2) > > > > > > > > I can't seem to find a way around this. What am I missing. > > > > > > Wouter > > > > > > > > > > On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfar...@redhat.com> > wrote: > >> > >> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.debor...@gmail.com > > > >> wrote: > >> > Hi all, > >> > > >> > I have found on the mailing list that it should be possible to have a > >> > multi > >> > datacenter setup, if latency is low enough. > >> > > >> > I would like to set this up, so that each datacenter has at least two > >> > replicas and each PG has a replication level of 3. > >> > > >> > In this mail, it is suggested that I should use the following crush > map > >> > for > >> > multi DC: > >> > > >> > rule dc { > >> > ruleset 0 > >> > type replicated > >> > min_size 1 > >> > max_size 10 > >> > step take default > >> > step chooseleaf firstn 0 type datacenter > >> > step emit > >> > } > >> > > >> > This looks suspicious to me, as it will only generate a list of two > >> > PG's, > >> > (and only one PG if one DC is down). > >> > > >> > I think I should use: > >> > > >> > rule replicated_ruleset { > >> > ruleset 0 > >> > type replicated > >> > min_size 1 > >> > max_size 10 > >> > step take root > >> > step choose firstn 2 type datacenter > >> > step chooseleaf firstn 2 type host > >> > step emit > >> > step take root > >> > step chooseleaf firstn -4 type host > >> > step emit > >> > } > >> > > >> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in > >> > the > >> > other and then a list of PG's > >> > > >> > The problem is that this list contains duplicates (e.g. for 8 OSDS per > >> > DC) > >> > > >> > [13,11,1,8,13,11,16,4,3,7] > >> > [9,2,13,11,9,15,12,18,3,5] > >> > [3,5,17,10,3,5,7,13,18,10] > >> > [7,6,11,14,7,14,3,16,4,11] > >> > [6,3,15,18,6,3,12,9,16,15] > >> > > >> > Will this be a problem? > >> > >> For replicated pools, it probably will cause trouble. For EC pools I > >> think it should work fine, but obviously you're losing all kinds of > >> redundancy. Nothing in the system will do work to avoid colocating > >> them if you use a rule like this. Rather than distributing some of the > >> replicas randomly across DCs, you really just want to split them up > >> evenly across datacenters (or in some ratio, if one has more space > >> than the other). Given CRUSH's current abilities that does require > >> building the replication size into the rule, but such is life. > >> > >> > >> > If crush is executed, will it only consider osd's which are (up,in) > or > >> > all > >> > OSD's in the map and then filter them from the list afterwards? > >> > >> CRUSH will consider all OSDs, but if it selects any OSDs which are out > >> then it retries until it gets one that is still marked in. > >> -Greg > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com