Re: [ceph-users] [EXTERNAL] Re: 2x replication: A BIG warning

Wido den Hollander Wed, 07 Dec 2016 14:01:29 -0800

> Op 7 december 2016 om 21:43 schreef "Will.Boege" <will.bo...@target.com>:
> 
> 
> Thanks for the explanation.  I guess this case you outlined explains why the 
> Ceph developers chose to make this a ‘safe’ default.
> 
> 2 osds are transiently down and the third fails hard. The PGs on the 3rd osd 
> with no more replicas are marked unfound.  You bring up 1 and 2 and these PGs 
> will remain unfound because they were stale, at that point you can either 
> revert or delete those PGs. Am I understanding that correctly?
>


That is about correct indeed. You can say that you accept the stale data and 
loose any changes which happened in the time that they were down.

> I still think there is a cost/benefit conversation you can have around this 
> setting.  A 2 OSD failure situation will be far far more probable than the 
> ‘sequence of events’ type failure you outlined above.  There is a cost to 
> several blocked IO events per year - availability, to protect from a data 
> loss event that might be a once every three year type thing. 
> 

That is a decision everybody has to make for itself. With replication set to 2 
I've seen many data loss situations and that is why I started this thread in 
the first place.

min_size is just a additional protection mechanism against data loss.

Wido

> I guess it’s just where you want to put that needle on the spectrum of 
> availability vs integrity.
> 
> On 12/7/16, 2:10 PM, "Wido den Hollander" <w...@42on.com> wrote:
> 
>     
>     > Op 7 december 2016 om 21:04 schreef "Will.Boege" 
> <will.bo...@target.com>:
>     > 
>     > 
>     > Hi Wido,
>     > 
>     > Just curious how blocking IO to the final replica provides protection 
> from data loss?  I’ve never really understood why this is a Ceph best 
> practice.  In my head all 3 replicas would be on devices that have roughly 
> the same odds of physically failing or getting logically corrupted in any 
> given minute.  Not sure how blocking IO prevents this.
>     > 
>     
>     Say, disk #1 fails and you have #2 and #3 left. Now #2 fails leaving only 
> #3 left.
>     
>     By block you know that #2 and #3 still have the same data. Although #2 
> failed it could be that it is the host which went down but the disk itself is 
> just fine. Maybe the SATA cable broke, you never know.
>     
>     If disk #3 now fails you can still continue your operation if you bring 
> #2 back. It has the same data on disk as #3 had before it failed. Since you 
> didn't allow for any I/O on #3 when #2 went down earlier.
>     
>     If you would have accepted writes on #3 while #1 and #2 were gone you 
> have invalid/old data on #2 by the time it comes back.
>     
>     Writes were made on #3 but that one really broke down. You managed to get 
> #2 back, but it doesn't have the changes which #3 had.
>     
>     The result is corrupted data.
>     
>     Does this make sense?
>     
>     Wido
>     
>     > On 12/7/16, 9:11 AM, "ceph-users on behalf of LOIC DEVULDER" 
> <ceph-users-boun...@lists.ceph.com on behalf of loic.devul...@mpsa.com> wrote:
>     > 
>     >     > -----Message d'origine-----
>     >     > De : Wido den Hollander [mailto:w...@42on.com]
>     >     > Envoyé : mercredi 7 décembre 2016 16:01
>     >     > À : ceph-us...@ceph.com; LOIC DEVULDER - U329683 
> <loic.devul...@mpsa.com>
>     >     > Objet : RE: [ceph-users] 2x replication: A BIG warning
>     >     > 
>     >     > 
>     >     > > Op 7 december 2016 om 15:54 schreef LOIC DEVULDER
>     >     > <loic.devul...@mpsa.com>:
>     >     > >
>     >     > >
>     >     > > Hi Wido,
>     >     > >
>     >     > > > As a Ceph consultant I get numerous calls throughout the year 
> to
>     >     > > > help people with getting their broken Ceph clusters back 
> online.
>     >     > > >
>     >     > > > The causes of downtime vary vastly, but one of the biggest 
> causes is
>     >     > > > that people use replication 2x. size = 2, min_size = 1.
>     >     > >
>     >     > > We are building a Ceph cluster for our OpenStack and for data 
> integrity
>     >     > reasons we have chosen to set size=3. But we want to continue to 
> access
>     >     > data if 2 of our 3 osd server are dead, so we decided to set 
> min_size=1.
>     >     > >
>     >     > > Is it a (very) bad idea?
>     >     > >
>     >     > 
>     >     > I would say so. Yes, downtime is annoying on your cloud, but data 
> loss if
>     >     > even worse, much more worse.
>     >     > 
>     >     > I would always run with min_size = 2 and manually switch to 
> min_size = 1
>     >     > if the situation really requires it at that moment.
>     >     > 
>     >     > Loosing two disks at the same time is something which doesn't 
> happen that
>     >     > much, but if it happens you don't want to modify any data on the 
> only copy
>     >     > which you still have left.
>     >     > 
>     >     > Setting min_size to 1 should be a manual action imho when size = 
> 3 and you
>     >     > loose two copies. In that case YOU decide at that moment if it is 
> the
>     >     > right course of action.
>     >     > 
>     >     > Wido
>     >     
>     >     Thanks for your quick response!
>     >     
>     >     That's make sense, I will try to convince my colleagues :-)
>     >     
>     >     Loic
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@lists.ceph.com
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     
>     > 
>     >
>     
>     
> 
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [EXTERNAL] Re: 2x replication: A BIG warning

Reply via email to