Re: [ceph-users] 2x replication: A BIG warning

Wido den Hollander Wed, 07 Dec 2016 05:58:54 -0800

> Op 7 december 2016 om 10:06 schreef Dan van der Ster <d...@vanderster.com>:
> 
> 
> Hi Wido,
> 
> Thanks for the warning. We have one pool as you described (size 2,
> min_size 1), simply because 3 replicas would be too expensive and
> erasure coding didn't meet our performance requirements. We are well
> aware of the risks, but of course this is a balancing act between risk
> and cost.
>


Well, that is good. You are aware of the risk.

> Anyway, I'm curious if you ever use
> osd_find_best_info_ignore_history_les in order to recover incomplete
> PGs (while accepting the possibility of data loss). I've used this on
> two colleagues' clusters over the past few months and as far as they
> could tell there was no detectable data loss in either case.
> 

No, not really. Most cases were a true drive failure and a second one during 
recovery. XFS was broken underneath.

> So I run with size = 2 because if something bad happens I'll
> re-activate the PG with osd_find_best_info_ignore_history_les, then
> re-scrub both within Ceph and via our external application.
> 
> Any thoughts on that?
> 

No real thoughts, but it will be mainly useful in a flapping case where a OSD 
might have outdated data, but that's still better then nothing there.

Wido

> Cheers, Dan
> 
> P.S. we're going to retry erasure coding for this cluster in 2017,
> because clearly 4+2 or similar would be much safer than size 2,
> provided we can get the needed performance.
> 
> 
> 
> On Wed, Dec 7, 2016 at 9:08 AM, Wido den Hollander <w...@42on.com> wrote:
> > Hi,
> >
> > As a Ceph consultant I get numerous calls throughout the year to help 
> > people with getting their broken Ceph clusters back online.
> >
> > The causes of downtime vary vastly, but one of the biggest causes is that 
> > people use replication 2x. size = 2, min_size = 1.
> >
> > In 2016 the amount of cases I have where data was lost due to these 
> > settings grew exponentially.
> >
> > Usually a disk failed, recovery kicks in and while recovery is happening a 
> > second disk fails. Causing PGs to become incomplete.
> >
> > There have been to many times where I had to use xfs_repair on broken disks 
> > and use ceph-objectstore-tool to export/import PGs.
> >
> > I really don't like these cases, mainly because they can be prevented 
> > easily by using size = 3 and min_size = 2 for all pools.
> >
> > With size = 2 you go into the danger zone as soon as a single disk/daemon 
> > fails. With size = 3 you always have two additional copies left thus 
> > keeping your data safe(r).
> >
> > If you are running CephFS, at least consider running the 'metadata' pool 
> > with size = 3 to keep the MDS happy.
> >
> > Please, let this be a big warning to everybody who is running with size = 
> > 2. The downtime and problems caused by missing objects/replicas are usually 
> > big and it takes days to recover from those. But very often data is lost 
> > and/or corrupted which causes even more problems.
> >
> > I can't stress this enough. Running with size = 2 in production is a 
> > SERIOUS hazard and should not be done imho.
> >
> > To anyone out there running with size = 2, please reconsider this!
> >
> > Thanks,
> >
> > Wido
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 2x replication: A BIG warning

Reply via email to