Hello, On Tue, 03 Jun 2014 18:52:00 +0200 Cedric Lemarchand wrote:
> Hello, > > Le 03/06/2014 12:14, Christian Balzer a écrit : > > A simple way to make 1) and 2) cheaper is to use AMD CPUs, they will do > > just fine at half the price with these loads. > > If you're that tight on budget, 64GB RAM will do fine, too. > I am interested about this specific thought, could you elaborate how did > you determine if such hardware (CPU and RAM) will handle well cases > where the cluster goes in rebalancing mode when a node or some OSD goes > down ? > Well, firstly we both read: https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf And looking at those values a single Opteron 4386 would be more than sufficient for both 1) and 2). I'm saying and suggesting a single CPU here to keep things all in one NUMA node. AFAIK (I haven't used anything Intel for years) some Intel boards require both CPUs in place to use all available interfaces (PCIe buses), so the above advice is only for AMD. As for RAM, it would be totally overspec'ed with 64GB, but a huge pagecache is an immense help for reads and RAM is fairly cheap these days, so the more you can afford, the better. Secondly experience. The above document is pretty much on spot when comes to CPU suggestions in combination with OSDs backed by a single HDD (SSD journal or not). I think it is overly optimistic when it comes to purely SSD based storage nodes or something like my HW RAID backed OSD. Remember, when using the 4k fio I could get Ceph to use about 2 cores per OSD and then stall on whatever locking contention or other things that are going on inside it before actually exhausting all available CPU resources. OSDs (journal and backing storage) as well as the network were nowhere near getting exhausted. Compared to that fio run a cluster rebalancing is a breeze, at least when it comes to CPU resources needed. It comes in a much more CEPH friendly IO block size and thus exhausts either network or disk bandwidth first. > Because, as Robert stated (and I totally agree with that!), designing a > cluster is about the expected performances in optimal conditions, and > expected recovery time and nodes loads in non optimal conditions > (typically rebalancing), and I found this last point hard to consider > and anticipate. > This is why one builds test clusters and then builds production HW clusters with the expectation that it will be twice as bad as anticipated from what you saw on the test cluster. ^o^ > As a quick exercise (without taking in consideration FS size overhead > ect ...), based on config "1.NG" from Christian (ratio SSD/HDD of 1:3, > thus 9x4TB HDD/nodes, 24 nodes) and replication ratio of 2 : I would never use a replication of 2 unless I were VERY confident in my backing storage devices (either high end and well monitored SSDs or RAIDs). > > - each nodes : ~36TB RAW /~18TB NET > - the whole cluster, 864TB RAW / ~432TB NET > > If a node goes down, ~36TB have to be re balanced between the 23 > existing, so ~1,6TB have to be read and write on each nodes. I think > this is the expected workload of the cluster in rebalancing mode. > > So 2 questions : > > * did my maths are good until now ? Math is hard, lets go shopping. ^o^ But yes, given your parameters that looks correct. > * where will be the main bottleneck with such configuration and workload > (CPU/IO/RAM/NET) ? how calculate it ? > See above. In the configurations suggested by Benjamin disk IO will be the bottleneck, as the network bandwidth is higher than write capacity of the SSDs and HDDs. CPU and RAM will not be an issue. The other thing to consider are the backfilling and/or recovery settings in CEPH, these will of course influence how much of an impact a node failure (and potential recovery of it) will have. Depending on those settings and the cluster load (as in client side) at the time of failure the most optimistic number for full recovery of redundancy I can come up with is about an hour, in reality it is probably going to be substantially longer. And during that time any further disk failure (with over 200 in the cluster a pretty decent probability) can result in irrecoverable data loss. Christian > > -- > Cédric > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com