Le 04/06/2014 03:23, Christian Balzer a écrit : > On Tue, 03 Jun 2014 18:52:00 +0200 Cedric Lemarchand wrote: >> Le 03/06/2014 12:14, Christian Balzer a écrit : >>> A simple way to make 1) and 2) cheaper is to use AMD CPUs, they will do >>> just fine at half the price with these loads. >>> If you're that tight on budget, 64GB RAM will do fine, too. >> I am interested about this specific thought, could you elaborate how did >> you determine if such hardware (CPU and RAM) will handle well cases >> where the cluster goes in rebalancing mode when a node or some OSD goes >> down ? > Well, firstly we both read: > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf I was not aware of this doc, it enlightens lots of questions I was having about CPU/RAM consideration.
Thanks for your very exhaustive explanations ;-) Cheers > > And looking at those values a single Opteron 4386 would be more > than sufficient for both 1) and 2). > I'm saying and suggesting a single CPU here to keep things all in one NUMA > node. > AFAIK (I haven't used anything Intel for years) some Intel boards require > both CPUs in place to use all available interfaces (PCIe buses), so the > above advice is only for AMD. > As for RAM, it would be totally overspec'ed with 64GB, but a huge > pagecache is an immense help for reads and RAM is fairly cheap these days, > so the more you can afford, the better. > > Secondly experience. > The above document is pretty much on spot when comes to CPU suggestions in > combination with OSDs backed by a single HDD (SSD journal or not). > I think it is overly optimistic when it comes to purely SSD based storage > nodes or something like my HW RAID backed OSD. > Remember, when using the 4k fio I could get Ceph to use about 2 cores > per OSD and then stall on whatever locking contention or other things that > are going on inside it before actually exhausting all available CPU > resources. > OSDs (journal and backing storage) as well as the network were nowhere > near getting exhausted. > > Compared to that fio run a cluster rebalancing is a breeze, at least when > it comes to CPU resources needed. > It comes in a much more CEPH friendly IO block size and thus exhausts > either network or disk bandwidth first. > >> Because, as Robert stated (and I totally agree with that!), designing a >> cluster is about the expected performances in optimal conditions, and >> expected recovery time and nodes loads in non optimal conditions >> (typically rebalancing), and I found this last point hard to consider >> and anticipate. >> > This is why one builds test clusters and then builds production HW > clusters with the expectation that it will be twice as bad as anticipated > from what you saw on the test cluster. ^o^ > >> As a quick exercise (without taking in consideration FS size overhead >> ect ...), based on config "1.NG" from Christian (ratio SSD/HDD of 1:3, >> thus 9x4TB HDD/nodes, 24 nodes) and replication ratio of 2 : > I would never use a replication of 2 unless I were VERY confident in my > backing storage devices (either high end and well monitored SSDs or RAIDs). > >> - each nodes : ~36TB RAW /~18TB NET >> - the whole cluster, 864TB RAW / ~432TB NET >> >> If a node goes down, ~36TB have to be re balanced between the 23 >> existing, so ~1,6TB have to be read and write on each nodes. I think >> this is the expected workload of the cluster in rebalancing mode. >> >> So 2 questions : >> >> * did my maths are good until now ? > Math is hard, lets go shopping. ^o^ > But yes, given your parameters that looks correct. >> * where will be the main bottleneck with such configuration and workload >> (CPU/IO/RAM/NET) ? how calculate it ? >> > See above. > In the configurations suggested by Benjamin disk IO will be the > bottleneck, as the network bandwidth is higher than write capacity of the > SSDs and HDDs. CPU and RAM will not be an issue. > > The other thing to consider are the backfilling and/or recovery settings > in CEPH, these will of course influence how much of an impact a node > failure (and potential recovery of it) will have. > Depending on those settings and the cluster load (as in client side) at > the time of failure the most optimistic number for full recovery of > redundancy I can come up with is about an hour, in reality it is probably > going to be substantially longer. > And during that time any further disk failure (with over 200 in the > cluster a pretty decent probability) can result in irrecoverable data loss. > > Christian >> -- >> Cédric >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cédric _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com