-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I would make sure that your CRUSH rules are designed for such a
failure. We currently have two racks and can suffer a one rack loss
without blocking I/O. Here is what we do:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
}
All pools are size=4 and min_size=2

This puts only two copies in each rack so that only half of the
objects can be taken down by a rack loss. We also configure ceph with
"mon_osd_downout_subtree_limit = host" so that it won't automatically
mark a whole rack out (not that it would do a whole lot in our current
2 rack config).

Our network failure (dual Ethernet switches) is two racks, so our next
failure domain is what we call a PUD or 2 racks. The 3-4 rack
configuration is similar to the above with the choose changed to pud.
Once we get to our 5th rack of storage, our config changes to:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type pud
        step emit
}
All pools are size=3 and min_size=2

In this configuration, only one copy is kept per PUD and we can lose
two racks in a PUD without blocking I/O in our cluster.

Under the default CRUSH rules, it is possible to get two objects in
one rack. What does `ceph osd crush rule dump` show?

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3
gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb
bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP
TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz
gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9
m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo
EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV
SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf
5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3
gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR
J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f
MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj
unZB
=fE8n
-----END PGP SIGNATURE-----


----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Jun 24, 2015 at 7:44 AM, Lionel Bouton <
lionel-subscript...@bouton.name> wrote:

>  On 06/24/15 14:44, Romero Junior wrote:
>
>  Hi,
>
>
>
> We are setting up a test environment using Ceph as the main storage
> solution for my QEMU-KVM virtualization platform, and everything works fine
> except for the following:
>
>
>
> When I simulate a failure by powering off the switches on one of our three
> racks my virtual machines get into a weird state, the illustration might
> help you to fully understand what is going on:
> http://i.imgur.com/clBApzK.jpg
>
>
>
> The PGs are distributed based on racks, there are not default crush rules.
>
>
> What is ceph -s telling while you are in this state ?
>
> 16000 pgs might be a problem: when your rack goes down, if your crushmap
> rules distribute pgs based on rack, with size = 2 approximately 2/3 of your
> pgs should be in a degraded state. This means that ~10666 pgs will have to
> copy data to get back to a active+clean state. Your 2 other racks will then
> be really busy. You can probably tune the recovery processes to avoid too
> much interference with your normal VM I/Os.
> You didn't tell where the monitors are placed (and there are only 2 on
> your illustration which means any one of them being unreachable will bring
> down your cluster).
>
> That said, I'm not sure that having a failure domain at the rack level
> when you only have 3 racks is a good idea. What you end up with when a
> switch fails is a reconfiguration of 2 third of your cluster, which is not
> desirable in any case. If possible, either distribute the hardware in more
> racks (4 racks : only 1/2 of your data will be affected, 5 racks only 2/5,
> ...) or make the switches redundant (each server with OSD connected to 2
> switches, ...).
>
> Not that with 33 servers per rack, 3 OSD per server and 3 racks you have
> approximately 300 disks. With so many disks, size=2 is probably too low to
> get at a negligible probability of losing data (even if the failure case is
> 2 amongst 100 and not 300). With only ~20 disks we already got near a 2
> simultaneous failure once (admitedly it was the combination of hardware and
> human error in the earlier days of our cluster). We currently have one
> failed disk and one giving signs (erratic performance) of hardware problems
> in a span of a few weeks.
>
> Best regards,
>
> Lionel
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to