Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Vasiliy Angapov Wed, 10 Jun 2015 13:54:06 -0700

Hi,

I also wrote a simple script which calculates the data loss probabilities
for triple disk failure. Here are some numbers:
OSDs: 10,   Pr: 138.89%
OSDs: 20,   Pr: 29.24%
OSDs: 30,   Pr: 12.32%
OSDs: 40,   Pr: 6.75%
OSDs: 50,   Pr: 4.25%
OSDs: 100, Pr: 1.03%
OSDs: 200, Pr: 0.25%
OSDs: 500, Pr: 0.04%


Here i assumed we have 100PGs per OSD. Also there is a constraint for 3
disks not to be in one host because this will not lead to a failure. For
situation where all disks are evenly distributed between 10 hosts it gives
us a correction coefficient of 83% so for 50 OSDs it will be something like
3.53% instead of 4.25%.

There is a further constraint for 2 disks in one host and 1 disk on another
but that's just adds unneeded complexity. Numbers will not change
significantly.
And actually triple simultaneous failure is itself not very likely to
happen, so i believe that starting from 100 OSDs we can somewhat relax
about data  failure.

BTW, this presentation has more math
http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph

Regards, Vasily.

On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster <d...@vanderster.com>
wrote:

> OK I wrote a quick script to simulate triple failures and count how
> many would have caused data loss. The script gets your list of OSDs
> and PGs, then simulates failures and checks if any permutation of that
> failure matches a PG.
>
> Here's an example with 10000 simulations on our production cluster:
>
> # ./simulate-failures.py
> We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like
> this: (945, 910, 399)
> Simulating 10000 failures
> Simulated 1000 triple failures. Data loss incidents = 0
> Data loss incident with failure (676, 451, 931)
> Simulated 2000 triple failures. Data loss incidents = 1
> Simulated 3000 triple failures. Data loss incidents = 1
> Simulated 4000 triple failures. Data loss incidents = 1
> Simulated 5000 triple failures. Data loss incidents = 1
> Simulated 6000 triple failures. Data loss incidents = 1
> Simulated 7000 triple failures. Data loss incidents = 1
> Simulated 8000 triple failures. Data loss incidents = 1
> Data loss incident with failure (1031, 1034, 806)
> Data loss incident with failure (449, 644, 329)
> Simulated 9000 triple failures. Data loss incidents = 3
> Simulated 10000 triple failures. Data loss incidents = 3
>
> End of simulation: Out of 10000 triple failures, 3 caused a data loss
> incident
>
>
> The script is here:
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/durability/simulate-failures.py
> Give it a try (on your test clusters!)
>
> Cheers, Dan
>
>
>
>
>
> On Wed, Jun 10, 2015 at 10:47 AM, Jan Schermer <j...@schermer.cz> wrote:
> > Yeah, I know but I believe it was fixed so that a single copy is
> sufficient for recovery now (even with min_size=1)? Depends on what you
> want to achieve...
> >
> > The point is that even if we lost “just” 1% of data, that’s too much
> (>0%) when talking about customer data, and I know from experience that
> some volumes are unavailable when I lose 3 OSDs -  and I don’t have that
> many volumes...
> >
> > Jan
> >
> >> On 10 Jun 2015, at 10:40, Dan van der Ster <d...@vanderster.com> wrote:
> >>
> >> I'm not a mathematician, but I'm pretty sure there are 200 choose 3 =
> >> 1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so
> >> that many combinations would cause data loss. So I think 1.2% of
> >> triple disk failures would lead to data loss. There might be another
> >> factor of 3! that needs to be applied to nPGs -- I'm currently
> >> thinking about that.
> >> But you're right, if indeed you do ever lose an entire PG, _every_ RBD
> >> device will have random holes in their data, like swiss cheese.
> >>
> >> BTW PGs can have stuck IOs without losing all three replicas -- see
> min_size.
> >>
> >> Cheers, Dan
> >>
> >> On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <j...@schermer.cz> wrote:
> >>> When you increase the number of OSDs, you generaly would (and should)
> increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384
> PGs.
> >>> RBD volume that has xxx GiBs of data gets striped across many PGs, so
> the probability that the volume loses at least part of its’ data is very
> significant.
> >>> Someone correct me if I’m wrong, but I _know_ (from sad experience)
> that with the current CRUSH map if 3 disks fail in 3 different hosts, lots
> of instances (maybe all of them) have their IO stuck until 3 copies of data
> are restored.
> >>>
> >>> I just tested that by hand
> >>> a 150GB volume will consist of ~150000/4=37500 objects
> >>> When I list their location with “ceph osd map”, every time I get a
> different pg, and a random mix of osds that host the PG.
> >>>
> >>> Thus, it is very likely that this volume will be lost when I lose any
> 3 osds, as at least one of the pgs will be hosted on all of them. What this
> probability is I don’t know - (I’m not good at statistics, is it
> combinations?) - but generally the data I care most about is stored in a
> multi-terrabyte volume, and even if the probability of failure was 0.1%,
> that’s several orders of magnitute too high for me to be comfortable.
> >>>
> >>> I’d like nothing more than for someone to tell me I’m wrong :-)
> >>>
> >>> Jan
> >>>
> >>>> On 10 Jun 2015, at 09:55, Dan van der Ster <d...@vanderster.com>
> wrote:
> >>>>
> >>>> This is a CRUSH misconception. Triple drive failures only cause data
> >>>> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples
> >>>> of OSDs are the only ones that matter). If you have very few OSDs,
> >>>> then its possibly true that any combination of disks would lead to
> >>>> failure. But as you increase the number of OSDs, the likelihood of
> >>>> triple sharing a PG decreases (even though the number of 3-way
> >>>> combinations increases).
> >>>>
> >>>> Cheers, Dan
> >>>>
> >>>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <j...@schermer.cz>
> wrote:
> >>>>> Hidden danger in the default CRUSH rules is that if you lose 3
> drives in 3 different hosts at the same time, you _will_ lose data, and not
> just some data but possibly a piece of every rbd volume you have...
> >>>>> And the probability of that happening is sadly nowhere near zero. We
> had drives drop out of cluster under load, which of course comes when a
> drive fails, then another fails, then another fails… not pretty.
> >>>>>
> >>>>> Jan
> >>>>>
> >>>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <rob...@leblancnet.us>
> wrote:
> >>>>>>
> >>>>>> Signed PGP part
> >>>>>> If you are using the default rule set (which I think has min_size
> 2),
> >>>>>> you can sustain 1-4 disk failures or one host failures.
> >>>>>>
> >>>>>> The reason disk failures vary so wildly is that you can lose all the
> >>>>>> disks in host.
> >>>>>>
> >>>>>> You can lose up to another 4 disks (in the same host) or 1 host
> >>>>>> without data loss, but I/O will block until Ceph can replicate at
> >>>>>> least one more copy (assuming the min_size 2 stated above).
> >>>>>> ----------------
> >>>>>> Robert LeBlanc
> >>>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:
> >>>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating
> system also
> >>>>>>> hosting 3 monitoring process) with default replica 3.
> >>>>>>>
> >>>>>>> Total OSD disks : 16
> >>>>>>> Total Nodes : 4
> >>>>>>>
> >>>>>>> How can i calculate the
> >>>>>>>
> >>>>>>> Maximum number of disk failures my cluster can handle with out
> any impact
> >>>>>>> on current data and new writes.
> >>>>>>> Maximum number of node failures  my cluster can handle with out
> any impact
> >>>>>>> on current data and new writes.
> >>>>>>>
> >>>>>>> Thanks for any help
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> ceph-users mailing list
> >>>>>>> ceph-users@lists.ceph.com
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Reply via email to