Re: [ceph-users] New cluster performance analysis

Nick Fisk Thu, 03 Dec 2015 04:31:53 -0800

Couple of things to check

1.      Can you create just a normal non cached pool and test performance to 
rule out any funnies going on there.
2.      Can you also run something like iostat during the benchmarks and see if 
it looks like all your disks are getting saturated.


                _____________________________________________
                From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
Behalf Of Adrien Gillard
                Sent: 02 December 2015 21:33
                To: ceph-us...@ceph.com
                Subject: [ceph-users] New cluster performance analysis
                

                Hi everyone, 
                 
                I am currently testing our new cluster and I would like some 
feedback on the numbers I am getting.
                 
                For the hardware : 
                7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP 
for public net., 2x10Gbits LACP for cluster net., MTU 9000
                1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, 
2x10Gbits LACP for public net., MTU 9000
                2 x MON : VMs (8 cores, 8GB RAM), backed by SSD
                 
                Journals are 20GB partitions on SSD
                 
                The system is CentOS 7.1 with stock kernel 
(3.10.0-229.20.1.el7.x86_64). No particular system optimizations.
                 
                Ceph is Infernalis from Ceph repository  : ceph version 9.2.0 
(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
                 
                [cephadm@cph-adm-01  ~/scripts]$ ceph -s
                    cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
                     health HEALTH_OK
                     monmap e1: 3 mons at 
{clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
                            election epoch 62, quorum 0,1,2 
clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
                     osdmap e844: 84 osds: 84 up, 84 in
                            flags sortbitwise
                      pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 
kobjects
                            8308 GB used, 297 TB / 305 TB avail
                                3136 active+clean
                 
                My ceph.conf :
                 
                [global]
                fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
                mon_initial_members = clb-cph-frpar2-mon-01, 
clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03
                mon_host = x.x.x.1,x.x.x.2,x.x.x.3
                auth_cluster_required = cephx
                auth_service_required = cephx
                auth_client_required = cephx
                filestore_xattr_use_omap = true
                public network = 10.25.25.0/24 <http://10.25.25.0/24> 
                cluster network = 10.25.26.0/24 <http://10.25.26.0/24> 
                debug_lockdep = 0/0
                debug_context = 0/0
                debug_crush = 0/0
                debug_buffer = 0/0
                debug_timer = 0/0
                debug_filer = 0/0
                debug_objecter = 0/0
                debug_rados = 0/0
                debug_rbd = 0/0
                debug_journaler = 0/0
                debug_objectcatcher = 0/0
                debug_client = 0/0
                debug_osd = 0/0
                debug_optracker = 0/0
                debug_objclass = 0/0
                debug_filestore = 0/0
                debug_journal = 0/0
                debug_ms = 0/0
                debug_monc = 0/0
                debug_tp = 0/0
                debug_auth = 0/0
                debug_finisher = 0/0
                debug_heartbeatmap = 0/0
                debug_perfcounter = 0/0
                debug_asok = 0/0
                debug_throttle = 0/0
                debug_mon = 0/0
                debug_paxos = 0/0
                debug_rgw = 0/0
                 
                [osd]
                osd journal size = 0
                osd mount options xfs = 
"rw,noatime,inode64,logbufs=8,logbsize=256k"
                filestore min sync interval = 5
                filestore max sync interval = 15
                filestore queue max ops = 2048
                filestore queue max bytes = 1048576000
                filestore queue committing max ops = 4096
                filestore queue committing max bytes = 1048576000
                filestore op thread = 32
                filestore journal writeahead = true
                filestore merge threshold = 40
                filestore split multiple = 8
                 
                journal max write bytes = 1048576000
                journal max write entries = 4096
                journal queue max ops = 8092
                journal queue max bytes = 1048576000
                 
                osd max write size = 512
                osd op threads = 16
                osd disk threads = 2
                osd op num threads per shard = 3
                osd op num shards = 10
                osd map cache size = 1024
                osd max backfills = 1
                osd recovery max active = 2
                 
                I have set up 2 pools : one for cache with 3x replication in 
front of an EC pool. At the moment I am only interested in the cache pool, so 
no promotions/flushes/evictions happen. 
                (I know, I am using the same set of OSD for hot and cold data, 
but in my use case they should not be used at the same time.)
                 
                I am accessing the cluster via RBD volumes mapped with the 
kernel module on CentOS 7.1. These volumes are formatted in XFS on the clients.
                 
                The journal SSDs seem to perform quite well according to the 
results of Sebastien Han’s benchmark suggestion (they are Sandisk) :
                write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec 
(this is for numjob=10)
                 
                Here are the rados bench tests : 
                 
                rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup

                Total time run:         121.410763
                Total writes made:      65357
                Write size:             4096
                Bandwidth (MB/sec):     2.1
                Stddev Bandwidth:       0.597
                Max bandwidth (MB/sec): 3.89
                Min bandwidth (MB/sec): 0.00781
                Average IOPS:           538
                Stddev IOPS:            152
                Max IOPS:               995
                Min IOPS:               2
                Average Latency:        0.0594
                Stddev Latency:         0.18
                Max latency:            2.82
                Min latency:            0.00494
                 
                And the results of the fio test with the following parameters :
                 
                [global]
                size=8G
                runtime=300
                ioengine=libaio
                invalidate=1
                direct=1
                sync=1
                fsync=1
                numjobs=32
                rw=randwrite
                name=4k-32-1-randwrite-libaio
                blocksize=4K
                iodepth=1
                directory=/mnt/rbd
                group_reporting=1

                4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: 
pid=20442: Wed Dec  2 21:38:30 2015
                  write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec
                    slat (usec): min=5, max=4726, avg=40.32, stdev=41.28
                    clat (msec): min=2, max=2208, avg=19.35, stdev=74.34
                     lat (msec): min=2, max=2208, avg=19.39, stdev=74.34
                    clat percentiles (msec):
                     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 
20.00th=[    4],
                     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 
60.00th=[    5],
                     | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38], 
95.00th=[   63],
                     | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074], 
99.95th=[ 1221],
                     | 99.99th=[ 1532]
                    bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48, 
stdev=102.09
                    lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 
100=4.03%
                    lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%
                  cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0, 
minf=928
                  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 
32=0.0%, >=64=0.0%
                     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
                     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
                     issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
                     latency   : target=0, window=0, percentile=100.00%, depth=1

                Run status group 0 (all jobs):
                  WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, 
maxb=3389KB/s, mint=300011msec, maxt=300011msec

                Disk stats (read/write):
                  rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, 
in_queue=5677825, util=100.00%
                
                
                And a job closer to what the actual workload would be 
(blocksize=200K, numjob=16, QD=32)

                200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: 
pid=4828: Wed Dec  2 18:58:53 2015
                  write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec
                    slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49
                    clat (msec): min=9, max=3584, avg=613.88, stdev=168.68
                     lat (msec): min=10, max=3584, avg=614.04, stdev=168.66
                    clat percentiles (msec):
                     |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502], 
20.00th=[  537],
                     | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594], 
60.00th=[  603],
                     | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701], 
95.00th=[  881],
                     | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], 
99.95th=[ 2671],
                     | 99.99th=[ 2999]
                    bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40, 
stdev=2009.86
                    lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 
250=0.08%
                    lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%
                  cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0, 
minf=433
                  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 
32=99.8%, >=64=0.0%
                     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
64=0.0%, >=64=0.0%
                     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 
64=0.0%, >=64=0.0%
                     issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
                     latency   : target=0, window=0, percentile=100.00%, 
depth=32

                Run status group 0 (all jobs):
                  WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, 
maxb=161367KB/s, mint=300189msec, maxt=300189msec

                Disk stats (read/write):
                  rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, 
in_queue=5887504, util=99.91%


                The 4k block performance does not interest me so much but is 
given as a reference. I am more looking for throughput, but anyway, the numbers 
seem quite low.

                Let's take IOPS, assuming the spinners can do 50 (4k) synced 
sustained IOPS (I hope they can do more ^^), we should be around 50x84/3 = 1400 
IOPS, which is far from rados bench (538) and fio (847). And surprisingly fio 
numbers are greater than rados.

                So I don't know wether I am missing something here or if 
something is going wrong (maybe both !).

                Any input would be very valuable.

                Thank you,

                Adrien << File: ATT00001.txt >>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New cluster performance analysis

Reply via email to