Re: [ceph-users] OSD server alternatives to choose

Christian Balzer Tue, 03 Jun 2014 18:24:07 -0700

Hello,

On Tue, 03 Jun 2014 18:52:00 +0200 Cedric Lemarchand wrote:


> Hello,
> 
> Le 03/06/2014 12:14, Christian Balzer a écrit :
> > A simple way to make 1) and 2) cheaper is to use AMD CPUs, they will do
> > just fine at half the price with these loads. 
> > If you're that tight on budget, 64GB RAM will do fine, too.
> I am interested about this specific thought, could you elaborate how did
> you determine if such hardware (CPU and RAM) will handle well cases
> where the cluster goes in rebalancing mode when a node or some OSD goes
> down ?
> 
Well, firstly we both read:
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

And looking at those values a single Opteron 4386 would be more
than sufficient for both 1) and 2). 
I'm saying and suggesting a single CPU here to keep things all in one NUMA
node. 
AFAIK (I haven't used anything Intel for years) some Intel boards require
both CPUs in place to use all available interfaces (PCIe buses), so the
above advice is only for AMD.
As for RAM, it would be totally overspec'ed with 64GB, but a huge
pagecache is an immense help for reads and RAM is fairly cheap these days,
so the more you can afford, the better. 

Secondly experience.
The above document is pretty much on spot when comes to CPU suggestions in
combination with OSDs backed by a single HDD (SSD journal or not).
I think it is overly optimistic when it comes to purely SSD based storage
nodes or something like my HW RAID backed OSD.
Remember, when using the 4k fio I could get Ceph to use about 2 cores
per OSD and then stall on whatever locking contention or other things that
are going on inside it before actually exhausting all available CPU
resources. 
OSDs (journal and backing storage) as well as the network were nowhere
near getting exhausted.

Compared to that fio run a cluster rebalancing is a breeze, at least when
it comes to CPU resources needed. 
It comes in a much more CEPH friendly IO block size and thus exhausts
either network or disk bandwidth first.

> Because, as Robert stated (and I totally agree with that!), designing a
> cluster is about the expected performances in optimal conditions, and
> expected recovery time and nodes loads in non optimal conditions
> (typically rebalancing), and I found this last point hard to consider
> and anticipate.
> 
This is why one builds test clusters and then builds production HW
clusters with the expectation that it will be twice as bad as anticipated
from what you saw on the test cluster. ^o^

> As a quick exercise (without taking in consideration FS size overhead
> ect ...), based on config "1.NG" from Christian (ratio SSD/HDD of 1:3,
> thus 9x4TB HDD/nodes, 24 nodes) and replication ratio of 2 :
I would never use a replication of 2 unless I were VERY confident in my
backing storage devices (either high end and well monitored SSDs or RAIDs).

> 
> - each nodes : ~36TB RAW /~18TB NET
> - the whole cluster, 864TB RAW / ~432TB NET
> 
> If a node goes down, ~36TB have to be re balanced between the 23
> existing, so ~1,6TB have to be read and write on each nodes. I think
> this is the expected workload of the cluster in rebalancing mode.
> 
> So 2 questions :
> 
> * did my maths are good until now ?
Math is hard, lets go shopping. ^o^
But yes, given your parameters that looks correct.
> * where will be the main bottleneck with such configuration and workload
> (CPU/IO/RAM/NET) ? how calculate it ?
> 
See above. 
In the configurations suggested by Benjamin disk IO will be the
bottleneck, as the network bandwidth is higher than write capacity of the
SSDs and HDDs. CPU and RAM will not be an issue.

The other thing to consider are the backfilling and/or recovery settings
in CEPH, these will of course influence how much of an impact a node
failure (and potential recovery of it) will have.
Depending on those settings and the cluster load (as in client side) at
the time of failure the most optimistic number for full recovery of
redundancy I can come up with is about an hour, in reality it is probably
going to be substantially longer. 
And during that time any further disk failure (with over 200 in the
cluster a pretty decent probability) can result in irrecoverable data loss.

Christian
> 
> --
> Cédric
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD server alternatives to choose

Reply via email to