-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I would make sure that your CRUSH rules are designed for such a failure. We currently have two racks and can suffer a one rack loss without blocking I/O. Here is what we do:
rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 2 type rack step chooseleaf firstn 2 type host step emit } All pools are size=4 and min_size=2 This puts only two copies in each rack so that only half of the objects can be taken down by a rack loss. We also configure ceph with "mon_osd_downout_subtree_limit = host" so that it won't automatically mark a whole rack out (not that it would do a whole lot in our current 2 rack config). Our network failure (dual Ethernet switches) is two racks, so our next failure domain is what we call a PUD or 2 racks. The 3-4 rack configuration is similar to the above with the choose changed to pud. Once we get to our 5th rack of storage, our config changes to: rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type pud step emit } All pools are size=3 and min_size=2 In this configuration, only one copy is kept per PUD and we can lose two racks in a PUD without blocking I/O in our cluster. Under the default CRUSH rules, it is possible to get two objects in one rack. What does `ceph osd crush rule dump` show? -----BEGIN PGP SIGNATURE----- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3 gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9 m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf 5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3 gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj unZB =fE8n -----END PGP SIGNATURE----- ---------------- Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Jun 24, 2015 at 7:44 AM, Lionel Bouton < lionel-subscript...@bouton.name> wrote: > On 06/24/15 14:44, Romero Junior wrote: > > Hi, > > > > We are setting up a test environment using Ceph as the main storage > solution for my QEMU-KVM virtualization platform, and everything works fine > except for the following: > > > > When I simulate a failure by powering off the switches on one of our three > racks my virtual machines get into a weird state, the illustration might > help you to fully understand what is going on: > http://i.imgur.com/clBApzK.jpg > > > > The PGs are distributed based on racks, there are not default crush rules. > > > What is ceph -s telling while you are in this state ? > > 16000 pgs might be a problem: when your rack goes down, if your crushmap > rules distribute pgs based on rack, with size = 2 approximately 2/3 of your > pgs should be in a degraded state. This means that ~10666 pgs will have to > copy data to get back to a active+clean state. Your 2 other racks will then > be really busy. You can probably tune the recovery processes to avoid too > much interference with your normal VM I/Os. > You didn't tell where the monitors are placed (and there are only 2 on > your illustration which means any one of them being unreachable will bring > down your cluster). > > That said, I'm not sure that having a failure domain at the rack level > when you only have 3 racks is a good idea. What you end up with when a > switch fails is a reconfiguration of 2 third of your cluster, which is not > desirable in any case. If possible, either distribute the hardware in more > racks (4 racks : only 1/2 of your data will be affected, 5 racks only 2/5, > ...) or make the switches redundant (each server with OSD connected to 2 > switches, ...). > > Not that with 33 servers per rack, 3 OSD per server and 3 racks you have > approximately 300 disks. With so many disks, size=2 is probably too low to > get at a negligible probability of losing data (even if the failure case is > 2 amongst 100 and not 300). With only ~20 disks we already got near a 2 > simultaneous failure once (admitedly it was the combination of hardware and > human error in the earlier days of our cluster). We currently have one > failed disk and one giving signs (erratic performance) of hardware problems > in a span of a few weeks. > > Best regards, > > Lionel > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com