Hi,

I understand from various reading and research that there are a number of
things to consider when deciding how many disks one wants to put into a
single chassis:

1. Higher density means higher failure domain (more data to re-replicate if
you lose a node)
2. More disks means more CPU/memory horsepower to handle the number of OSDs
3. Network becomes a bottleneck with too many OSDs per node
4. ...

We are looking at building high density nodes for small scale 'starter'
deployments for our customers (maybe 4 or 5 nodes).  High density in this
case could mean a 2u chassis with 2x external 45 disk JBOD containers
attached.  That's 90 3TB disks/OSDs to be managed by a single node.  That's
about 243TB of potential usable space, and so (assuming up to 75% fillage)
maybe 182TB of potential data 'loss' in the event of a node failure.  On an
uncongested, unused, 10Gbps network, my back-of-a-beer-mat calculations say
that would take about 45 hours to get the cluster back into an undegraded
state - that is the requisite number of copies of all objects.

Assuming that you can shove in a pair of hex core hyperthreaded processors,
you're probably OK with number 2.  If you're already considering 10GbE
networking for the storage network, there's probably not much you can do
about 3 unless you want to spend a lot more money (and the reason we're
going so dense is to keep this as a cheap option).  So the main thing would
seem to be a real fear of 'losing' so much data in the event of a node
failure.  Who wants to wait 45 hours (probably much longer assuming the
cluster remains live and has production traffic traversing that networl)
for the cluster to self-heal?

But surely this fear is based on an assumption that in that time, you've
not identified and replaced the failed chassis.  That you would sit for 2-3
days and just leave the cluster to catch up, and not actually address the
broken node.  Given good data centre processes, a good stock of spare
parts, isn't it more likely that you'd have replaced that node and got
things back up and running in a mater of hours?  In all likelyhood, a node
crash/failure is not likely to have taken out all, or maybe any, of the
disks, and a new chassis can just have the JBODs plugged back in and away
you go?

I'm sure I'm missing some other pieces, but if you're comfortable with your
hardware replacement processes, doesn't number 1 become a non-fear really?
I understand that in some ways it goes against the concept of ceph being
self healing, and that in an ideal world you'd have lots of lower density
nodes to limit your failure domain, but when being driven by cost isn't
this an OK way to look at things?  What other glaringly obvious
considerations am I missing with this approach?

Darren
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to