[ceph-users] Re: Worst thing that can happen if I have size= 2

Frank Schilder Fri, 05 Feb 2021 05:02:25 -0800

I think the answer is very simple: Data loss. You are setting yourself up for 
data loss. Having only +1 redundancy is a design flaw and you will be fully 
responsible for loosing data on such a set-up. If this is not a problem, then 
that's an option. If this will get you fired, its not.

> There is a big difference between traditional RAID1 and Ceph. Namely, with
> Ceph, there are nodes where OSDs are running, and these nodes need
> maintenance. You want to be able to perform maintenance even if you have
> one broken OSD, that's why the recommendation is to have three copies with
> Ceph. There is no such "maintenance" consideration with traditional RAID1,
> so two copies are OK there.

Yes, this is exactly the point. The keyword is "redundancy under degraded 
conditions". Its not just that you want to be able to maintain stuff if an OSD 
is down, you want to be able to maintain stuff without risking data loss every 
single time. A simple example is OS updates that require reboots. Every of 
these operations opens a window of opportunity for something else to fail.

The other thing is admin errors. Redundancy under degraded conditions allows 
admins to commit 1 or more extra mistakes during maintenance. I learned this 
the hard way when upgrading our MON data disks. We have 3 MONs and I needed to 
migrate each MON store to new storage. Of course I managed to install the new 
disks in one and wipe the MON store on another MON. 2 hours downtime. Will 
upgrade to 5 MONs as soon as possible.

More serious examples are ceph upgrades. There are plenty of instances where 
these went wrong for one or the other reason and people needed to redeploy 
entire OSD hosts. Loooong window of opportunity for data loss during complete 
rebuild.

And never trust your boss when he says "we will replace everything long before 
MTBF". This is BS as soon as budgets get cut.

I think, however, another really important aspect is the data security. In a 
small cluster you might get away with thinking in typical RAID terms. However, 
a scale-out cluster is defined by the property that multiple simultaneous disk 
fails will be observed regularly. Simultaneous meaning fails within the window 
of opportunity opened by degraded objects being present.

The limit for observing this is not as high as one might think. Pushing prices 
means pushing hardware to the physical limits and quality control will not 
catch everything. We got a batch of 8 disks that seem not to be great. I had 
already one fail (half a year in production) and others regularly show up with 
slow ops. Its not bad enough to get them replaced, so I have to deal with it. 
They are all in one host, so I can sleep, but it is quite likely that a few of 
them go while the cluster is rebuilding redundancy.

For scale-out storage the distributed RAID of ceph comes to the rescue, without 
this it would be impossible to run a scale out system. If you do the stats on 
the probability of loosing sufficiently many OSDs that share a PG, you will 
find out that this probability goes down exponentially with the number of extra 
copies/shards, where +1 just leaves you at ordinary RAID level - meaning its 
dangerous.

Taking all of this together, maintainability and probability of data loss, I 
regret that I didn't go for EC 8+3 (3 extra shards) instead of 8+2. For 
replication the same holds, 3-times is the lowest number that is safe but 4 is 
a lot lot better.

Bottom line is, data loss and ruined weekends/holidays are not worth going 
cheap. If I get an hardware alert at night, I want to be able to turn around 
and continue sleeping.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Alexander E. Patrakov <patra...@gmail.com>
Sent: 04 February 2021 11:35:27
To: Mario Giammarco
Cc: ceph-users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2

There is a big difference between traditional RAID1 and Ceph. Namely, with
Ceph, there are nodes where OSDs are running, and these nodes need
maintenance. You want to be able to perform maintenance even if you have
one broken OSD, that's why the recommendation is to have three copies with
Ceph. There is no such "maintenance" consideration with traditional RAID1,
so two copies are OK there.

чт, 4 февр. 2021 г. в 00:49, Mario Giammarco <mgiamma...@gmail.com>:

> Thanks Simon and thanks to other people that have replied.
> Sorry but I try to explain myself better.
> It is evident to me that if I have two copies of data, one brokes and while
> ceph creates again a new copy of the data also the disk with the second
> copy brokes you lose the data.
> It is obvious and a bit paranoid because many servers on many customers run
> on raid1 and so you are saying: yeah you have two copies of the data but
> you can broke both. Consider that in ceph recovery is automatic, with raid1
> some one must manually go to the customer and change disks. So ceph is
> already an improvement in this case even with size=2. With size 3 and min 2
> it is a bigger improvement I know.
>
> What I ask is this: what happens with min_size=1 and split brain, network
> down or similar things: do ceph block writes because it has no quorum on
> monitors? Are there some failure scenarios that I have not considered?
> Thanks again!
> Mario
>
>
>
> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
> sirons...@caffetine.org> ha scritto:
>
> > On 03/02/2021 09:24, Mario Giammarco wrote:
> > > Hello,
> > > Imagine this situation:
> > > - 3 servers with ceph
> > > - a pool with size 2 min 1
> > >
> > > I know perfectly the size 3 and min 2 is better.
> > > I would like to know what is the worst thing that can happen:
> >
> > Hi Mario,
> >
> > This thread is worth a read, it's an oldie but a goodie:
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
> >
> > Especially this post, which helped me understand the importance of
> > min_size=2
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
> >
> > Cheers,
> > Simon
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

--
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Worst thing that can happen if I have size= 2

Reply via email to