Hi Gregory,
Thanks for your replies. Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts). 2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC H800 + 11 2TB SAS disks (one unused SSD...) The EC pool is defined with k=4, m=1 I set the failure domain to OSD for the test The OSDs are set up with XFS and a 10GB journal 1st partition (the single doomed-dell SSD was a bottleneck for 23 disks…) All disks are presently configured with a single-RAID0 because H700/H800 do not support JBOD. I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command : rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 --run-name "bench_`hostname -s`" --no-cleanup I'm aggregating the average bandwidth at the end of the tests. I'm monitoring the Ceph servers stats live with this dstat command: dstat -N p2p1,p2p2,total The network MTU is 9000 on all nodes. With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for the whole 2-nodes ceph cluster / 5 clients. I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either get 550 MiB/s or 590MiB/s of aggregated clients bandwidth. Not much less considering I removed half disks ! I'm therefore starting to think I am CPU /memory bandwidth limited... ? That's not however what I am tempted to conclude (for the cpu at least) when I see the dstat output, as it says the cpus still sit idle or IO waiting : ----total-cpu-usage---- -dsk/total- --net/p2p1----net/p2p2---net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send: recv send: recv send| in out | int csw 1 1 97 0 0 0| 586k 1870k| 0 0 : 0 0 : 0 0 | 49B 455B|8167 15k 29 17 24 27 0 3| 128k 734M| 367M 870k: 0 0 : 367M 870k| 0 0 | 61k 61k 30 17 34 16 0 3| 432k 750M| 229M 567k: 199M 168M: 427M 168M| 0 0 | 65k 68k 25 14 38 20 0 3| 16k 634M| 232M 654k: 162M 133M: 393M 134M| 0 0 | 56k 64k 19 10 46 23 0 2| 232k 463M| 244M 670k: 184M 138M: 428M 139M| 0 0 | 45k 55k 15 8 46 29 0 1| 368k 422M| 213M 623k: 149M 110M: 362M 111M| 0 0 | 35k 41k 25 17 37 19 0 3| 48k 584M| 139M 394k: 137M 90M: 276M 91M| 0 0 | 54k 53k Could it be the interruptions or system context switches that cause this relatively poor performance per node ? PCI-E interractions with the PERC cards ? I know I can get way more disk throughput with dd (command below) ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 1 97 0 0 0| 595k 2059k| 0 0 | 634B 2886B|7971 15k 1 93 0 3 0 3| 0 1722M| 49k 78k| 0 0 | 40k 47k 1 93 0 3 0 3| 0 1836M| 40k 69k| 0 0 | 45k 57k 1 95 0 2 0 2| 0 1805M| 40k 69k| 0 0 | 38k 34k 1 94 0 3 0 2| 0 1864M| 37k 38k| 0 0 | 35k 24k (…) Dd command : # use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo "writing $FS_THR times (threads) " $[ 4 * FILE_MB ] " mb on $i..." ; for j in `seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M count=$[ FILE_MB / 4 ] & done ; done ; wait) ; echo "wrote $[ N_FS * FILE_MB * FS_THR ] MB on $N_FS FS with $FS_THR threads" ; rm -f /var/lib/ceph/osd/*/test.zero* Hope I gave you more insights on what I’m trying to achieve, and where I’m failing ? Regards -----Message d'origine----- De : Gregory Farnum [mailto:g...@gregs42.com] Envoyé : mercredi 22 juillet 2015 16:01 À : Florent MONTHEL Cc : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ?? We might also be able to help you improve or better understand your results if you can tell us exactly what tests you're conducting that are giving you these numbers. -Greg On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL <fmont...@flox-arts.net<mailto:fmont...@flox-arts.net>> wrote: > Hi Frederic, > > When you have Ceph cluster with 1 node you don’t experienced network and > communication overhead due to distributed model > With 2 nodes and EC 4+1 you will have communication between 2 nodes but you > will keep internal communication (2 chunks on first node and 3 chunks on > second node) > On your configuration EC pool is setup with 4+1 so you will have for each > write overhead due to write spreading on 5 nodes (for 1 customer IO, you > will experience 5 Ceph IO due to EC 4+1) > It’s the reason for that I think you’re reaching performance stability with > 5 nodes and more in your cluster > > > On Jul 20, 2015, at 10:35 AM, SCHAER Frederic > <frederic.sch...@cea.fr<mailto:frederic.sch...@cea.fr>> > wrote: > > Hi, > > As I explained in various previous threads, I’m having a hard time getting > the most out of my test ceph cluster. > I’m benching things with rados bench. > All Ceph hosts are on the same 10GB switch. > > Basically, I know I can get about 1GB/s of disk write performance per host, > when I bench things with dd (hundreds of dd threads) +iperf 10gbit > inbound+iperf 10gbit outbound. > I also can get 2GB/s or even more if I don’t bench the network at the same > time, so yes, there is a bottleneck between disks and network, but I can’t > identify which one, and it’s not relevant for what follows anyway > (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about > this strange bottleneck though…) > > My hosts each are connected though a single 10Gbits/s link for now. > > My problem is the following. Please note I see the same kind of poor > performance with replicated pools... > When testing EC pools, I ended putting a 4+1 pool on a single node in order > to track down the ceph bottleneck. > On that node, I can get approximately 420MB/s write performance using rados > bench, but that’s fair enough since the dstat output shows that real data > throughput on disks is about 800+MB/s (that’s the ceph journal effect, I > presume). > > I tested Ceph on my other standalone nodes : I can also get around 420MB/s, > since they’re identical. > I’m testing things with 5 10Gbits/s clients, each running rados bench. > > But what I really don’t get is the following : > > - With 1 host : throughput is 420MB/s > - With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s. > - With 5 hosts : I get around 1375MB/s . That’s far from the > expected 2GB/s. > > The network never is maxed out, nor are the disks or CPUs. > The hosts throughput I see with rados bench seems to match the dstat > throughput. > That’s as if each additional host was only capable of adding 220MB/s of > throughput. Compare this to the 1GB/s they are capable of (420MB/s with > journals)… > > I’m therefore wondering what could possibly be so wrong with my setup ?? > Why would it impact so much the performance to add hosts ? > > On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards. > I know, not perfect, but not THAT bad neither… ? > > Any hint would be greatly appreciated ! > > Thanks > Frederic Schaer > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com