If you have Replica size 3, your failure domain is host, and you have 3
servers... you will NEVER have 2 copies of the data on 1 server.  If you
weight your OSDs poorly on one of your servers, then one of the drives will
fill up to the full ratio in its config and stop receiving writes.  You
should always monitor your OSDs so that you can fix the weights before an
OSD becomes nearfull and definitely so that the OSD never reaches the FULL
setting and stops receiving writes.  Note that when it stops receiving
writes, it will block the write requests and until it has space to fulfill
the write and the cluster will be stuck.

Also to truly answer your question, if you had Replica size 3, your failure
domain is host, and you only have 2 servers in your cluster... You will
only be storing 2 copies of data and every single PG in your cluster will
be degraded.  Ceph will never breach the boundary of your failure domain.

When dealing with 3 node clusters you want to be careful to never fill up
your cluster past a % where you can lose a drive in one of your nodes.  For
example, if you have 3 nodes with 3x 4TB drives in each and you lose a
drive... the other 2 OSDs in that node need to be able to take the data
from the dead drive without going over 80% (the default nearfull setting).
So in this scenario you shouldn't fill the cluster to be more than 53%
unless you're planning to tell the cluster not to backfill until the dead
OSD is replaced.

I will never recommend anyone to go into production with a cluster smaller
than N+2 your replica size of failure domains.  So if you have the default
Replica size of 3, then you should go into production with at least 5
servers.  This gives you enough failure domains to be able to handle drive
failures without the situation being critical.

On Fri, Apr 14, 2017 at 11:25 AM Adam Carheden <carhe...@ucar.edu> wrote:

> Is redundancy across failure domains guaranteed or best effort?
>
> Note: The best answer to the questions below is obviously to avoid the
> situation by properly weight drives and not approaching the full ratio.
> I'm just curious how CEPH works.
>
> Hypothetical situation:
> Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you
> weighted the OSDs poorly such that the OSDs on one server filled up but
> the OSDs on the others still had space. CEPH could still store 3
> replicas of your data, but two of them would be on the same server. What
> happens?
>
> (select all that apply)
> a.[ ] Clients can still read data
> b.[ ] Clients can still write data
> c.[ ] health = HEALTH_WARN
> d.[ ] health = HEALTH_OK
> e.[ ] PGs are degraded
> f.[ ] ceph stores only two copies of data
> g.[ ] ceph stores 3 copies of data, two of which are on the same server
> h.[ ] something else?
>
> If the answer is "best effort" (a+b+d+g), how would you detect if that
> scenario is occurring?
>
> If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in
> that scenario, is there any way to tell CEPH to store temporarily store
> 2 copies on a single server just in case? I suspect the answer is to
> remove host bucket from the crushmap but that that's a really bad idea
> because it would trigger a rebuild and the extra disk activity increases
> the likelihood of additional drive failures, correct?
>
> --
> Adam Carheden
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to