Hi, I also wrote a simple script which calculates the data loss probabilities for triple disk failure. Here are some numbers: OSDs: 10, Pr: 138.89% OSDs: 20, Pr: 29.24% OSDs: 30, Pr: 12.32% OSDs: 40, Pr: 6.75% OSDs: 50, Pr: 4.25% OSDs: 100, Pr: 1.03% OSDs: 200, Pr: 0.25% OSDs: 500, Pr: 0.04%
Here i assumed we have 100PGs per OSD. Also there is a constraint for 3 disks not to be in one host because this will not lead to a failure. For situation where all disks are evenly distributed between 10 hosts it gives us a correction coefficient of 83% so for 50 OSDs it will be something like 3.53% instead of 4.25%. There is a further constraint for 2 disks in one host and 1 disk on another but that's just adds unneeded complexity. Numbers will not change significantly. And actually triple simultaneous failure is itself not very likely to happen, so i believe that starting from 100 OSDs we can somewhat relax about data failure. BTW, this presentation has more math http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph Regards, Vasily. On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster <d...@vanderster.com> wrote: > OK I wrote a quick script to simulate triple failures and count how > many would have caused data loss. The script gets your list of OSDs > and PGs, then simulates failures and checks if any permutation of that > failure matches a PG. > > Here's an example with 10000 simulations on our production cluster: > > # ./simulate-failures.py > We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like > this: (945, 910, 399) > Simulating 10000 failures > Simulated 1000 triple failures. Data loss incidents = 0 > Data loss incident with failure (676, 451, 931) > Simulated 2000 triple failures. Data loss incidents = 1 > Simulated 3000 triple failures. Data loss incidents = 1 > Simulated 4000 triple failures. Data loss incidents = 1 > Simulated 5000 triple failures. Data loss incidents = 1 > Simulated 6000 triple failures. Data loss incidents = 1 > Simulated 7000 triple failures. Data loss incidents = 1 > Simulated 8000 triple failures. Data loss incidents = 1 > Data loss incident with failure (1031, 1034, 806) > Data loss incident with failure (449, 644, 329) > Simulated 9000 triple failures. Data loss incidents = 3 > Simulated 10000 triple failures. Data loss incidents = 3 > > End of simulation: Out of 10000 triple failures, 3 caused a data loss > incident > > > The script is here: > > https://github.com/cernceph/ceph-scripts/blob/master/tools/durability/simulate-failures.py > Give it a try (on your test clusters!) > > Cheers, Dan > > > > > > On Wed, Jun 10, 2015 at 10:47 AM, Jan Schermer <j...@schermer.cz> wrote: > > Yeah, I know but I believe it was fixed so that a single copy is > sufficient for recovery now (even with min_size=1)? Depends on what you > want to achieve... > > > > The point is that even if we lost “just” 1% of data, that’s too much > (>0%) when talking about customer data, and I know from experience that > some volumes are unavailable when I lose 3 OSDs - and I don’t have that > many volumes... > > > > Jan > > > >> On 10 Jun 2015, at 10:40, Dan van der Ster <d...@vanderster.com> wrote: > >> > >> I'm not a mathematician, but I'm pretty sure there are 200 choose 3 = > >> 1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so > >> that many combinations would cause data loss. So I think 1.2% of > >> triple disk failures would lead to data loss. There might be another > >> factor of 3! that needs to be applied to nPGs -- I'm currently > >> thinking about that. > >> But you're right, if indeed you do ever lose an entire PG, _every_ RBD > >> device will have random holes in their data, like swiss cheese. > >> > >> BTW PGs can have stuck IOs without losing all three replicas -- see > min_size. > >> > >> Cheers, Dan > >> > >> On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <j...@schermer.cz> wrote: > >>> When you increase the number of OSDs, you generaly would (and should) > increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384 > PGs. > >>> RBD volume that has xxx GiBs of data gets striped across many PGs, so > the probability that the volume loses at least part of its’ data is very > significant. > >>> Someone correct me if I’m wrong, but I _know_ (from sad experience) > that with the current CRUSH map if 3 disks fail in 3 different hosts, lots > of instances (maybe all of them) have their IO stuck until 3 copies of data > are restored. > >>> > >>> I just tested that by hand > >>> a 150GB volume will consist of ~150000/4=37500 objects > >>> When I list their location with “ceph osd map”, every time I get a > different pg, and a random mix of osds that host the PG. > >>> > >>> Thus, it is very likely that this volume will be lost when I lose any > 3 osds, as at least one of the pgs will be hosted on all of them. What this > probability is I don’t know - (I’m not good at statistics, is it > combinations?) - but generally the data I care most about is stored in a > multi-terrabyte volume, and even if the probability of failure was 0.1%, > that’s several orders of magnitute too high for me to be comfortable. > >>> > >>> I’d like nothing more than for someone to tell me I’m wrong :-) > >>> > >>> Jan > >>> > >>>> On 10 Jun 2015, at 09:55, Dan van der Ster <d...@vanderster.com> > wrote: > >>>> > >>>> This is a CRUSH misconception. Triple drive failures only cause data > >>>> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples > >>>> of OSDs are the only ones that matter). If you have very few OSDs, > >>>> then its possibly true that any combination of disks would lead to > >>>> failure. But as you increase the number of OSDs, the likelihood of > >>>> triple sharing a PG decreases (even though the number of 3-way > >>>> combinations increases). > >>>> > >>>> Cheers, Dan > >>>> > >>>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <j...@schermer.cz> > wrote: > >>>>> Hidden danger in the default CRUSH rules is that if you lose 3 > drives in 3 different hosts at the same time, you _will_ lose data, and not > just some data but possibly a piece of every rbd volume you have... > >>>>> And the probability of that happening is sadly nowhere near zero. We > had drives drop out of cluster under load, which of course comes when a > drive fails, then another fails, then another fails… not pretty. > >>>>> > >>>>> Jan > >>>>> > >>>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <rob...@leblancnet.us> > wrote: > >>>>>> > >>>>>> Signed PGP part > >>>>>> If you are using the default rule set (which I think has min_size > 2), > >>>>>> you can sustain 1-4 disk failures or one host failures. > >>>>>> > >>>>>> The reason disk failures vary so wildly is that you can lose all the > >>>>>> disks in host. > >>>>>> > >>>>>> You can lose up to another 4 disks (in the same host) or 1 host > >>>>>> without data loss, but I/O will block until Ceph can replicate at > >>>>>> least one more copy (assuming the min_size 2 stated above). > >>>>>> ---------------- > >>>>>> Robert LeBlanc > >>>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > >>>>>> > >>>>>> > >>>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar wrote: > >>>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating > system also > >>>>>>> hosting 3 monitoring process) with default replica 3. > >>>>>>> > >>>>>>> Total OSD disks : 16 > >>>>>>> Total Nodes : 4 > >>>>>>> > >>>>>>> How can i calculate the > >>>>>>> > >>>>>>> Maximum number of disk failures my cluster can handle with out > any impact > >>>>>>> on current data and new writes. > >>>>>>> Maximum number of node failures my cluster can handle with out > any impact > >>>>>>> on current data and new writes. > >>>>>>> > >>>>>>> Thanks for any help > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> ceph-users mailing list > >>>>>>> ceph-users@lists.ceph.com > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list > >>>>>> ceph-users@lists.ceph.com > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@lists.ceph.com > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com