> Consider a cluster of 8 OSD servers with 3 disks on each server. 
> 
> If I use a profile setting of k=5, m=3 and  ruleset-failure-domain=host ;
> 
> As far as I understand it can tolerate failure of 3 OSDs and 1 host, am I 
> right ?

When setting up your pool, you specify a crush map which says what your 
"failure domain” is. You can think of a failure domain as "what’s the largest 
single thing that could fail and the cluster would still survive?”. By default 
this is a node (a server). Large clusters often use a rack instead.  Ceph 
places your data across the OSDs in your cluster so that if that large single 
thing (node or rack) fails, your data is still safe and available.

If you specify a single OSD (a disk) as your failure domain, then ceph might 
end up placing lots of data on different OSDs on the same node. This is a bad 
idea since if that node goes down you'll lose several OSDs, and so your data 
might not survive.

If you have 8 nodes, and erasure of 5+3, then with the default failure domain 
of a node your data will be spread across all 8 nodes (data chunks on 5 of 
them, and parity chunks on the other three). Therefore you could tolerate 3 
whole nodes failing. You are right that 5+3 encoding will result in 1.6xdata 
disk usage.

If you were being pathological about minimising disk usage, I think you could 
in theory set a failure domain of an OSD, then use 8+2 encoding with a crush 
map that never used more than 2 OSDs in each node for a placement group. Then 
technically you could tolerate a node failure. I doubt anyone would recommend 
that though!

That said, here’s a question for others: say a cluster only has 4 nodes (each 
with many OSDs), would you use 2+2 or 4+4? Either way you use 2xdata space and 
could lose 2 nodes (assuming a proper crush map), but presumably the 4+4 would 
be faster and you could lose more OSDs?

Oliver.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to