Le 04/06/2014 03:23, Christian Balzer a écrit :
> On Tue, 03 Jun 2014 18:52:00 +0200 Cedric Lemarchand wrote:
>> Le 03/06/2014 12:14, Christian Balzer a écrit :
>>> A simple way to make 1) and 2) cheaper is to use AMD CPUs, they will do
>>> just fine at half the price with these loads. 
>>> If you're that tight on budget, 64GB RAM will do fine, too.
>> I am interested about this specific thought, could you elaborate how did
>> you determine if such hardware (CPU and RAM) will handle well cases
>> where the cluster goes in rebalancing mode when a node or some OSD goes
>> down ?
> Well, firstly we both read:
> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
I was not aware of this doc, it enlightens lots of questions I was
having about CPU/RAM consideration.

Thanks for your very exhaustive explanations ;-)

Cheers
>
> And looking at those values a single Opteron 4386 would be more
> than sufficient for both 1) and 2). 
> I'm saying and suggesting a single CPU here to keep things all in one NUMA
> node. 
> AFAIK (I haven't used anything Intel for years) some Intel boards require
> both CPUs in place to use all available interfaces (PCIe buses), so the
> above advice is only for AMD.
> As for RAM, it would be totally overspec'ed with 64GB, but a huge
> pagecache is an immense help for reads and RAM is fairly cheap these days,
> so the more you can afford, the better. 
>
> Secondly experience.
> The above document is pretty much on spot when comes to CPU suggestions in
> combination with OSDs backed by a single HDD (SSD journal or not).
> I think it is overly optimistic when it comes to purely SSD based storage
> nodes or something like my HW RAID backed OSD.
> Remember, when using the 4k fio I could get Ceph to use about 2 cores
> per OSD and then stall on whatever locking contention or other things that
> are going on inside it before actually exhausting all available CPU
> resources. 
> OSDs (journal and backing storage) as well as the network were nowhere
> near getting exhausted.
>
> Compared to that fio run a cluster rebalancing is a breeze, at least when
> it comes to CPU resources needed. 
> It comes in a much more CEPH friendly IO block size and thus exhausts
> either network or disk bandwidth first.
>
>> Because, as Robert stated (and I totally agree with that!), designing a
>> cluster is about the expected performances in optimal conditions, and
>> expected recovery time and nodes loads in non optimal conditions
>> (typically rebalancing), and I found this last point hard to consider
>> and anticipate.
>>
> This is why one builds test clusters and then builds production HW
> clusters with the expectation that it will be twice as bad as anticipated
> from what you saw on the test cluster. ^o^
>
>> As a quick exercise (without taking in consideration FS size overhead
>> ect ...), based on config "1.NG" from Christian (ratio SSD/HDD of 1:3,
>> thus 9x4TB HDD/nodes, 24 nodes) and replication ratio of 2 :
> I would never use a replication of 2 unless I were VERY confident in my
> backing storage devices (either high end and well monitored SSDs or RAIDs).
>
>> - each nodes : ~36TB RAW /~18TB NET
>> - the whole cluster, 864TB RAW / ~432TB NET
>>
>> If a node goes down, ~36TB have to be re balanced between the 23
>> existing, so ~1,6TB have to be read and write on each nodes. I think
>> this is the expected workload of the cluster in rebalancing mode.
>>
>> So 2 questions :
>>
>> * did my maths are good until now ?
> Math is hard, lets go shopping. ^o^
> But yes, given your parameters that looks correct.
>> * where will be the main bottleneck with such configuration and workload
>> (CPU/IO/RAM/NET) ? how calculate it ?
>>
> See above. 
> In the configurations suggested by Benjamin disk IO will be the
> bottleneck, as the network bandwidth is higher than write capacity of the
> SSDs and HDDs. CPU and RAM will not be an issue.
>
> The other thing to consider are the backfilling and/or recovery settings
> in CEPH, these will of course influence how much of an impact a node
> failure (and potential recovery of it) will have.
> Depending on those settings and the cluster load (as in client side) at
> the time of failure the most optimistic number for full recovery of
> redundancy I can come up with is about an hour, in reality it is probably
> going to be substantially longer. 
> And during that time any further disk failure (with over 200 in the
> cluster a pretty decent probability) can result in irrecoverable data loss.
>
> Christian
>> --
>> Cédric
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cédric

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to