Re: [ceph-users] multi-datacenter crush map

Wouter De Borger Mon, 21 Sep 2015 02:04:00 -0700

Thank you for your answer! We will use size=4 and min_size=2, which should
do the trick.


For the monitor issue, we have a third datacenter (with higher latency, but
that shouldn't be a problem for the monitors)

We had also considered the locality issue. Our WAN round trip latency is
1.5 ms (now) and we should get a dedicated light path in the near future
(<0.1 ms).
So we hope to get acceptable latency without additional tweaking.
The plan B is to make two pools, with different weights for the different
DC's. VM's in DC 1 will get high weight for DC 1, VM's in DC 2 will get
high weight for DC 2.

Thanks,
Wouter

On Sat, Sep 19, 2015 at 9:31 PM, Robert LeBlanc <rob...@leblancnet.us>
wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> You will want size=4 min_size=2 if you want to keep I/O going if a DC
> fails and ensure some data integrity. Data checksumming (I think is
> being added) would provide much stronger data integrity checking in a
> two copy situation as you would be able to tell which of the two
> copies is the good copy instead of needing a third to break the tie.
>
> However, you have yet another problem on your hands. The way monitors
> works makes this tricky. If you have one monitor in one DC and two in
> the other, if the two monitor DC burns down, the surviving cluster
> stops working too because there isn't more than 50% of the monitors
> available. Putting two monitors in each DC only causes both to stop
> working if one goes down (you need three to make a quorum). It has
> been suggested that putting the odd monitor in the cloud (or other
> off-site location to both DCs) could be an option, but latency could
> cause problems. The cloud monitor would complete the quorum with
> whichever DC survives.
>
> Also remember that there is no data locality awareness in Ceph at the
> moment. This could mean that the primary for a PG is in the other DC.
> So your client has to contact the primary in the other DC, then that
> OSD contacts one OSD in it's DC and two in the other and has to get
> confirmation that the write is acknowledged then ack the write to the
> client. For a write you will be between 2 x ( LAN latency + WAN
> latency ) and 2 x ( LAN latency + 2 x WAN latency ). Additionally your
> reads will be between 2 x LAN latency and 2 x WAN latency. Then there
> is write amplification so you need to make sure you have a lot more
> WAN bandwidth than you think you need.
>
> I think the large majority of us are eagerly waiting for the RBD
> replication feature or some sort of lag behind OSD for situations like
> this.
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV/bgrCRDmVDuy+mK58QAAOTsQAJd5Q/3uVMP6D0U+iZv/
> FGvEThfxLqarEo/n/TAPiJdCeZP9sKr8szTP72Iajt5UAwH8Ry5qcvClUoet
> LmMXfOxHJQJcMbXcKHxI8G7w9h/8ExkGA3GkoBYltUvZ9+oEI30ANHZphBiK
> HhaLWanrEKh8L4EbXnqA9JvEYwf1BGDvxKbdvFDNSIIbDywN3DJn7OavRhC9
> M63GQnFmxSO6F+Oy1q5vMfpur/VtZ27GRfzIDsougRTmM5q9zbdpSY8pHrrZ
> RDExkM1t0orl1gUnbNhl/YgQTGfU/XWpEKtJju7Wk9Ciem5SFczJRWsputHc
> AhBtnxBoEInlsnpHKnCsPvbY8wEcoo+YxNt79/M3cR8x0UzXl+/4SoDlYnSK
> X3afL/YmVnbCV6hoxl2LAOqHbTYasN9VxQIbpQe4kAzSq45yJX//k8NRXBfD
> +hGF8qfxpcbTe/9IjJiqwe+ZpaAd4vX7Xfq4oHHeMwWUrvd8sXSbr5CIV1AJ
> CYsixEy2gJ0oFFVKcBGtzAfBUxJHb/FAcAuV97zSdYyYRplMq5Qjaz/hwGeu
> 9pC83kxY40pfzdD9uEElWoI3+6/34LdNo4TLi3IM8aeZmNGzzIgt/MxAuFOk
> 9Jf2Dwmab0+Ut6uJasY4Fr6HiyNoeTXea+CSWrnvsMohOseyJg996GUP3gUl
> OEoA
> =PfhN
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Sat, Sep 19, 2015 at 12:54 PM, Wouter De Borger <w.debor...@gmail.com>
> wrote:
> > Ok, so if I understand correctly, for replication level 3 or 4 I would
> have
> > to use the rule
> >
> > rule replicated_ruleset {
> >
> >     ruleset 0
> >     type replicated
> >     min_size 1
> >     max_size 10
> >     step take root
> >     step choose firstn 2 type datacenter
> >     step chooseleaf firstn 2 type host
> >     step emit
> > }
> >
> > The question I have now is: how will it behave when a DC goes down?
> > (Assuming catastrophic failure, the thing burns down)
> >
> > For example, if I set replication to 3, min_rep to 3.
> > Then, if a DC goes down, crush will only return 2 PG's, so everything
> will
> > hang  (same for 4/4 and 4/3)
> >
> > If I set replication to 3, min_rep to 2, it could occur that all data of
> a
> > PG is in one DC (degraded mode). if this DC goes down, the PG will
> hang,....
> > As far as I know, degraded PG's will still accept writes, so data loss is
> > possible. (same for 4/2)
> >
> >
> >
> > I can't seem to find a way around this. What am I missing.
> >
> >
> > Wouter
> >
> >
> >
> >
> > On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum <gfar...@redhat.com>
> wrote:
> >>
> >> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger <w.debor...@gmail.com
> >
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have found on the mailing list that it should be possible to have a
> >> > multi
> >> > datacenter setup, if latency is low enough.
> >> >
> >> > I would like to set this up, so that each datacenter has at least two
> >> > replicas and each PG has a replication level of 3.
> >> >
> >> > In this mail, it is suggested that I should use the following crush
> map
> >> > for
> >> > multi DC:
> >> >
> >> > rule dc {
> >> >     ruleset 0
> >> >     type replicated
> >> >     min_size 1
> >> >     max_size 10
> >> >     step take default
> >> >     step chooseleaf firstn 0 type datacenter
> >> >     step emit
> >> > }
> >> >
> >> > This looks suspicious to me, as it will only generate a list of two
> >> > PG's,
> >> > (and only one PG if one DC is down).
> >> >
> >> > I think I should use:
> >> >
> >> > rule replicated_ruleset {
> >> >     ruleset 0
> >> >     type replicated
> >> >     min_size 1
> >> >     max_size 10
> >> >     step take root
> >> >     step choose firstn 2 type datacenter
> >> >     step chooseleaf firstn 2 type host
> >> >     step emit
> >> >     step take root
> >> >     step chooseleaf firstn -4 type host
> >> >     step emit
> >> > }
> >> >
> >> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in
> >> > the
> >> > other and then a list of PG's
> >> >
> >> > The problem is that this list contains duplicates (e.g. for 8 OSDS per
> >> > DC)
> >> >
> >> > [13,11,1,8,13,11,16,4,3,7]
> >> > [9,2,13,11,9,15,12,18,3,5]
> >> > [3,5,17,10,3,5,7,13,18,10]
> >> > [7,6,11,14,7,14,3,16,4,11]
> >> > [6,3,15,18,6,3,12,9,16,15]
> >> >
> >> > Will this be a problem?
> >>
> >> For replicated pools, it probably will cause trouble. For EC pools I
> >> think it should work fine, but obviously you're losing all kinds of
> >> redundancy. Nothing in the system will do work to avoid colocating
> >> them if you use a rule like this. Rather than distributing some of the
> >> replicas randomly across DCs, you really just want to split them up
> >> evenly across datacenters (or in some ratio, if one has more space
> >> than the other). Given CRUSH's current abilities that does require
> >> building the replication size into the rule, but such is life.
> >>
> >>
> >> > If crush is executed, will it only consider osd's which are (up,in)
> or
> >> > all
> >> > OSD's in the map and then filter them from the list afterwards?
> >>
> >> CRUSH will consider all OSDs, but if it selects any OSDs which are out
> >> then it retries until it gets one that is still marked in.
> >> -Greg
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] multi-datacenter crush map

Reply via email to