Hi everyone,
I am currently testing our new cluster and I would like some feedback on the numbers I am getting. For the hardware : 7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for public net., 2x10Gbits LACP for cluster net., MTU 9000 1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, 2x10Gbits LACP for public net., MTU 9000 2 x MON : VMs (8 cores, 8GB RAM), backed by SSD Journals are 20GB partitions on SSD The system is CentOS 7.1 with stock kernel (3.10.0-229.20.1.el7.x86_64). No particular system optimizations. Ceph is Infernalis from Ceph repository : ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) [cephadm@cph-adm-01 ~/scripts]$ ceph -s cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce health HEALTH_OK monmap e1: 3 mons at {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0} election epoch 62, quorum 0,1,2 clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03 osdmap e844: 84 osds: 84 up, 84 in flags sortbitwise pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 kobjects 8308 GB used, 297 TB / 305 TB avail 3136 active+clean My ceph.conf : [global] fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce mon_initial_members = clb-cph-frpar2-mon-01, clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03 mon_host = x.x.x.1,x.x.x.2,x.x.x.3 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public network = 10.25.25.0/24 cluster network = 10.25.26.0/24 debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0 [osd] osd journal size = 0 osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k" filestore min sync interval = 5 filestore max sync interval = 15 filestore queue max ops = 2048 filestore queue max bytes = 1048576000 filestore queue committing max ops = 4096 filestore queue committing max bytes = 1048576000 filestore op thread = 32 filestore journal writeahead = true filestore merge threshold = 40 filestore split multiple = 8 journal max write bytes = 1048576000 journal max write entries = 4096 journal queue max ops = 8092 journal queue max bytes = 1048576000 osd max write size = 512 osd op threads = 16 osd disk threads = 2 osd op num threads per shard = 3 osd op num shards = 10 osd map cache size = 1024 osd max backfills = 1 osd recovery max active = 2 I have set up 2 pools : one for cache with 3x replication in front of an EC pool. At the moment I am only interested in the cache pool, so no promotions/flushes/evictions happen. (I know, I am using the same set of OSD for hot and cold data, but in my use case they should not be used at the same time.) I am accessing the cluster via RBD volumes mapped with the kernel module on CentOS 7.1. These volumes are formatted in XFS on the clients. The journal SSDs seem to perform quite well according to the results of Sebastien Han’s benchmark suggestion (they are Sandisk) : write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec (this is for numjob=10) Here are the rados bench tests : rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup Total time run: 121.410763 Total writes made: 65357 Write size: 4096 Bandwidth (MB/sec): 2.1 Stddev Bandwidth: 0.597 Max bandwidth (MB/sec): 3.89 Min bandwidth (MB/sec): 0.00781 Average IOPS: 538 Stddev IOPS: 152 Max IOPS: 995 Min IOPS: 2 Average Latency: 0.0594 Stddev Latency: 0.18 Max latency: 2.82 Min latency: 0.00494 And the results of the fio test with the following parameters : [global] size=8G runtime=300 ioengine=libaio invalidate=1 direct=1 sync=1 fsync=1 numjobs=32 rw=randwrite name=4k-32-1-randwrite-libaio blocksize=4K iodepth=1 directory=/mnt/rbd group_reporting=1 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442: Wed Dec 2 21:38:30 2015 write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec slat (usec): min=5, max=4726, avg=40.32, stdev=41.28 clat (msec): min=2, max=2208, avg=19.35, stdev=74.34 lat (msec): min=2, max=2208, avg=19.39, stdev=74.34 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], | 70.00th=[ 6], 80.00th=[ 7], 90.00th=[ 38], 95.00th=[ 63], | 99.00th=[ 322], 99.50th=[ 570], 99.90th=[ 1074], 99.95th=[ 1221], | 99.99th=[ 1532] bw (KB /s): min= 1, max= 448, per=3.64%, avg=123.48, stdev=102.09 lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03% lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16% cpu : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, maxb=3389KB/s, mint=300011msec, maxt=300011msec Disk stats (read/write): rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, in_queue=5677825, util=100.00% And a job closer to what the actual workload would be (blocksize=200K, numjob=16, QD=32) 200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: pid=4828: Wed Dec 2 18:58:53 2015 write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49 clat (msec): min=9, max=3584, avg=613.88, stdev=168.68 lat (msec): min=10, max=3584, avg=614.04, stdev=168.66 clat percentiles (msec): | 1.00th=[ 375], 5.00th=[ 469], 10.00th=[ 502], 20.00th=[ 537], | 30.00th=[ 553], 40.00th=[ 578], 50.00th=[ 594], 60.00th=[ 603], | 70.00th=[ 627], 80.00th=[ 652], 90.00th=[ 701], 95.00th=[ 881], | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], 99.95th=[ 2671], | 99.99th=[ 2999] bw (KB /s): min= 260, max=18181, per=6.31%, avg=10189.40, stdev=2009.86 lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08% lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09% cpu : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, maxb=161367KB/s, mint=300189msec, maxt=300189msec Disk stats (read/write): rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, in_queue=5887504, util=99.91% The 4k block performance does not interest me so much but is given as a reference. I am more looking for throughput, but anyway, the numbers seem quite low. Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which is far from rados bench (538) and fio (847). And surprisingly fio numbers are greater than rados. So I don't know wether I am missing something here or if something is going wrong (maybe both !). Any input would be very valuable. Thank you, Adrien
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com