Hi Cephers, I've setup a production environment Ceph cluster with the Jewel release (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON Servers and 6 OSD Servers:
3x MON Servers: 2x Intel Xeon E5-2630v3@2.40Ghz 384GB RAM 2x 200G Intel DC3700 in RAID-1 for OS 1x InfiniBand ConnectX-3 ADPT DP 6x OSD Servers: 2x Intel Xeon E5-2650v2@2.60Ghz 128GB RAM 2x 200G Intel DC3700 in RAID-1 for OS 12x 800G Intel DC3510 (osd & journal) on same device 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on the CLUS network) ceph.conf file is: [global] fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxx mon_initial_members = cibm01, cibm02, cibm03 mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = xx.xx.16.0/20 cluster_network = xx.xx.32.0/20 [mon] [mon.cibm01] host = cibm01 mon_addr = xx.xx.xx.1:6789 [mon.cibm02] host = cibm02 mon_addr = xx.xx.xx.2:6789 [mon.cibm03] host = cibm03 mon_addr = xx.xx.xx.3:6789 [osd] osd_pool_default_size = 2 osd_pool_default_min_size = 1 ## OSD Configuration ## [osd.0] host = cibn01 public_addr = xx.xx.17.1 cluster_addr = xx.xx.32.1 [osd.1] host = cibn01 public_addr = xx.xx.17.1 cluster_addr = xx.xx.32.1 ... They are all running *Ubuntu 14.04.4 LTS*. Journals are 5GB partitions on each disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For example: sdc 8:32 0 745.2G 0 disk |-sdc1 8:33 0 740.2G 0 part /var/lib/ceph/osd/ceph-0 `-sdc2 8:34 0 5G 0 part The purpose of this cluster will be to serve as a backend storage for Cinder volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters on OpenStack will be non-relational databases like Cassandra with many instances each. All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with Mellanox Technologies MT27500 Family [ConnectX-3] adapters. So I assume that performance will be really nice, right?...but.. I'm getting some numbers that I think they could be really more important. # rados --pool rbd bench 10 write -t 16 Total writes made: 1964 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): *755.435* Stddev Bandwidth: 90.3288 Max bandwidth (MB/sec): 884 Min bandwidth (MB/sec): 612 Average IOPS: 188 Stddev IOPS: 22 Max IOPS: 221 Min IOPS: 153 Average Latency(s): 0.0836802 Stddev Latency(s): 0.147561 Max latency(s): 1.50925 Min latency(s): 0.0192736 Then I connect to another server (this one is running on QDR - so I would expect something between 2-3Gb/s), I map a RBD on the host, then create a ext4 fs and mount it, and finally run a fio test: # fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22 --time_based --size=10G --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1 fio-2.1.3 Starting 8 processes cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB) Jobs: 7 (f=7): [wwwwww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops] [eta 00m:00s] cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr 7 15:24:12 2016 write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec slat (msec): min=1, max=480, avg=46.15, stdev=63.68 clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64 lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63 clat percentiles (msec): | 1.00th=[ 235], 5.00th=[ 478], 10.00th=[ 611], 20.00th=[ 766], | 30.00th=[ 889], 40.00th=[ 988], 50.00th=[ 1106], 60.00th=[ 1237], | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555], | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586], | 99.99th=[ 8979] bw (KB /s): min= 3091, max=209877, per=12.31%, avg=83280.51, stdev=35226.98 lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61% lat (msec) : 2000=45.04%, >=2000=13.69% cpu : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337 IO depths : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0% issued : total=r=0/w=3821/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s, mint=23138msec, maxt=23138msec Disk stats (read/write): rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996, util=99.08% Does it look acceptable? I mean for an InfiniBand network, I guess that throughput need to be better. How much more can I expect to achieve by tuning the servers? The MTU on the OSD servers is: MTU: 65520 Any drop packet found txqueuelen:256 Also I've setup on the openib.conf file: ... SET_IPOIB_CM=yes IPOIB_MTU=65520 ... And on mlnx.conf file: ... options mlx4_core enable_sys_tune=1 options mlx4_core log_num_mgm_entry_size=-7 Anyone here with experience on Infiniband setups can give me any hint in order to 'improve' performance, I'm getting similar numbers with another cluster on a 10GbE network :S Thanks, *German*
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com