Re: [ceph-users] New cluster performance analysis

Adrien Gillard Fri, 04 Dec 2015 05:31:48 -0800

After some more tests :

 - The pool being used as cache pool has no impact on performance, I get
the same results with a "dedicated" replicated pool.
 - You are right Jan, on raw devices I get better performance on a volume
if I fill it first, or at least if I write a zone that already has been
allocated
 - The same seem to apply when the test is run on the mounted filesystem.






On Thu, Dec 3, 2015 at 2:49 PM, Adrien Gillard <gillard.adr...@gmail.com>
wrote:

> I did some more tests :
>
> fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS
>
> I also tuned xfs mount options on client (I realized I didn't do that
> already) and with
> "largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime"
> I get better performance :
>
> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu Dec
>  3 10:45:55 2015
>   write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec
>     slat (usec): min=5, max=1620, avg=41.61, stdev=25.82
>     clat (msec): min=1, max=4141, avg=14.61, stdev=112.55
>      lat (msec): min=1, max=4141, avg=14.65, stdev=112.55
>     clat percentiles (msec):
>      |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
>      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
>      | 70.00th=[    5], 80.00th=[    6], 90.00th=[    7], 95.00th=[    7],
>      | 99.00th=[  227], 99.50th=[  717], 99.90th=[ 1844], 99.95th=[ 2245],
>      | 99.99th=[ 3097]
>
> So, more than 50% improvement but it actually varies quite a lot between
> tests (sometimes I get a bit more than 1000). If I run the test fo 30
> minutes it drops to 900 IOPS.
>
> As you suggested I also filled a volume with zeros (dd if=/dev/zero
> of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot
> of improvement.
>
> If I run fio test directly on block devices I seem to saturate the
> spinners, [1] is a graph of IO load on one of the OSD host.
> [2] is the same OSD graph but when the test is done on a device mounted
> and formatted with XFS on the client.
> If I get half of the IOPS on the XFS volume because of the journal,
> shouldn't I get the same amount of IOPS on the backend ?
> [3] shows what happen if I run the test for 30 minutes.
>
> During the fio tests on the raw device, load average on the OSD servers
> increases up to 13/14 and I get a bit of iowait (I guess because the OSD
> are busy)
> During the fio tests on the raw device, load average on the OSD servers
> peaks at the beginning and decreases to 5/6, but goes trough the roof on
> the client.
> Scheduler is deadline for all the drives, I didn't try to change it yet.
>
> What I don't understand, even with your explanations, are the rados
> results. From what I understand it performs at the RADOS level and thus
> should not be impacted by client filesystem.
> Given the results above I guess you are right and this has to do with the
> client filesystem.
>
> The cluster will be used for backups, write IO size during backups is
> around 150/200K (I guess mostly sequential) and I am looking for the
> highest bandwith and parallelization.
>
> @Nick, I will try to create a new stand alone replicated pool.
>
>
> [1] http://postimg.org/image/qvtvdq1n1/
> [2] http://postimg.org/image/nhf6lzwgl/
> [3] http://postimg.org/image/h7l0obw7h/
>
> On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>
>> Couple of things to check
>>
>> 1.      Can you create just a normal non cached pool and test
>> performance to rule out any funnies going on there.
>>
>> 2.      Can you also run something like iostat during the benchmarks and
>> see if it looks like all your disks are getting saturated.
>>
>>
>>
>>    _____________________________________________
>>       *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
>>       <ceph-users-boun...@lists.ceph.com>]* On Behalf Of* Adrien Gillard
>>       *Sent:* 02 December 2015 21:33
>>       *To:* ceph-us...@ceph.com
>>       *Subject:* [ceph-users] New cluster performance analysis
>>
>>       Hi everyone,
>>
>>
>>
>>       I am currently testing our new cluster and I would like some
>>       feedback on the numbers I am getting.
>>
>>
>>
>>       For the hardware :
>>
>>       7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for
>>       public net., 2x10Gbits LACP for cluster net., MTU 9000
>>
>>       1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD,
>>       2x10Gbits LACP for public net., MTU 9000
>>
>>       2 x MON : VMs (8 cores, 8GB RAM), backed by SSD
>>
>>
>>
>>       Journals are 20GB partitions on SSD
>>
>>
>>
>>       The system is CentOS 7.1 with stock kernel
>>       (3.10.0-229.20.1.el7.x86_64). No particular system optimizations.
>>
>>
>>
>>       Ceph is Infernalis from Ceph repository  : ceph version 9.2.0
>>       (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>>
>>
>>
>>       [cephadm@cph-adm-01  ~/scripts]$ ceph -s
>>
>>           cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
>>
>>            health HEALTH_OK
>>
>>            monmap e1: 3 mons at
>>       
>> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
>>
>>                   election epoch 62, quorum 0,1,2
>>       clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
>>
>>            osdmap e844: 84 osds: 84 up, 84 in
>>
>>                   flags sortbitwise
>>
>>             pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220
>>       kobjects
>>
>>                   8308 GB used, 297 TB / 305 TB avail
>>
>>                       3136 active+clean
>>
>>
>>
>>       My ceph.conf :
>>
>>
>>
>>       [global]
>>
>>       fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
>>
>>       mon_initial_members = clb-cph-frpar2-mon-01,
>>       clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03
>>
>>       mon_host = x.x.x.1,x.x.x.2,x.x.x.3
>>
>>       auth_cluster_required = cephx
>>
>>       auth_service_required = cephx
>>
>>       auth_client_required = cephx
>>
>>       filestore_xattr_use_omap = true
>>
>>       public network = *10.25.25.0/24*
>>       
>> <http://xo4t.mj.am/link/xo4t/r36izu2/1/7CTRRFQ_Kf3wNWPYd7QbWA/aHR0cDovLzEwLjI1LjI1LjAvMjQ>
>>
>>       cluster network = *10.25.26.0/24*
>>       
>> <http://xo4t.mj.am/link/xo4t/r36izu2/2/HsIkz-1efpdFK1tjIFEU0A/aHR0cDovLzEwLjI1LjI2LjAvMjQ>
>>
>>       debug_lockdep = 0/0
>>
>>       debug_context = 0/0
>>
>>       debug_crush = 0/0
>>
>>       debug_buffer = 0/0
>>
>>       debug_timer = 0/0
>>
>>       debug_filer = 0/0
>>
>>       debug_objecter = 0/0
>>
>>       debug_rados = 0/0
>>
>>       debug_rbd = 0/0
>>
>>       debug_journaler = 0/0
>>
>>       debug_objectcatcher = 0/0
>>
>>       debug_client = 0/0
>>
>>       debug_osd = 0/0
>>
>>       debug_optracker = 0/0
>>
>>       debug_objclass = 0/0
>>
>>       debug_filestore = 0/0
>>
>>       debug_journal = 0/0
>>
>>       debug_ms = 0/0
>>
>>       debug_monc = 0/0
>>
>>       debug_tp = 0/0
>>
>>       debug_auth = 0/0
>>
>>       debug_finisher = 0/0
>>
>>       debug_heartbeatmap = 0/0
>>
>>       debug_perfcounter = 0/0
>>
>>       debug_asok = 0/0
>>
>>       debug_throttle = 0/0
>>
>>       debug_mon = 0/0
>>
>>       debug_paxos = 0/0
>>
>>       debug_rgw = 0/0
>>
>>
>>
>>       [osd]
>>
>>       osd journal size = 0
>>
>>       osd mount options xfs =
>>       "rw,noatime,inode64,logbufs=8,logbsize=256k"
>>
>>       filestore min sync interval = 5
>>
>>       filestore max sync interval = 15
>>
>>       filestore queue max ops = 2048
>>
>>       filestore queue max bytes = 1048576000
>>
>>       filestore queue committing max ops = 4096
>>
>>       filestore queue committing max bytes = 1048576000
>>
>>       filestore op thread = 32
>>
>>       filestore journal writeahead = true
>>
>>       filestore merge threshold = 40
>>
>>       filestore split multiple = 8
>>
>>
>>
>>       journal max write bytes = 1048576000
>>
>>       journal max write entries = 4096
>>
>>       journal queue max ops = 8092
>>
>>       journal queue max bytes = 1048576000
>>
>>
>>
>>       osd max write size = 512
>>
>>       osd op threads = 16
>>
>>       osd disk threads = 2
>>
>>       osd op num threads per shard = 3
>>
>>       osd op num shards = 10
>>
>>       osd map cache size = 1024
>>
>>       osd max backfills = 1
>>
>>       osd recovery max active = 2
>>
>>
>>
>>       I have set up 2 pools : one for cache with 3x replication in front
>>       of an EC pool. At the moment I am only interested in the cache pool, 
>> so no
>>       promotions/flushes/evictions happen.
>>
>>       (I know, I am using the same set of OSD for hot and cold data, but
>>       in my use case they should not be used at the same time.)
>>
>>
>>
>>       I am accessing the cluster via RBD volumes mapped with the kernel
>>       module on CentOS 7.1. These volumes are formatted in XFS on the 
>> clients.
>>
>>
>>
>>       The journal SSDs seem to perform quite well according to the
>>       results of Sebastien Han’s benchmark suggestion (they are Sandisk) :
>>
>>       write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec
>>       (this is for numjob=10)
>>
>>
>>
>>       Here are the rados bench tests :
>>
>>
>>
>>       rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup
>>
>>       Total time run:         121.410763
>>
>>       Total writes made:      65357
>>
>>       Write size:             4096
>>
>>       Bandwidth (MB/sec):     2.1
>>
>>       Stddev Bandwidth:       0.597
>>
>>       Max bandwidth (MB/sec): 3.89
>>
>>       Min bandwidth (MB/sec): 0.00781
>>
>>       Average IOPS:           538
>>
>>       Stddev IOPS:            152
>>
>>       Max IOPS:               995
>>
>>       Min IOPS:               2
>>
>>       Average Latency:        0.0594
>>
>>       Stddev Latency:         0.18
>>
>>       Max latency:            2.82
>>
>>       Min latency:            0.00494
>>
>>
>>
>>       And the results of the fio test with the following parameters :
>>
>>
>>
>>       [global]
>>
>>       size=8G
>>
>>       runtime=300
>>
>>       ioengine=libaio
>>
>>       invalidate=1
>>
>>       direct=1
>>
>>       sync=1
>>
>>       fsync=1
>>
>>       numjobs=32
>>
>>       rw=randwrite
>>
>>       name=4k-32-1-randwrite-libaio
>>
>>       blocksize=4K
>>
>>       iodepth=1
>>
>>       directory=/mnt/rbd
>>
>>       group_reporting=1
>>
>>       4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442:
>>       Wed Dec  2 21:38:30 2015
>>
>>         write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec
>>
>>           slat (usec): min=5, max=4726, avg=40.32, stdev=41.28
>>
>>           clat (msec): min=2, max=2208, avg=19.35, stdev=74.34
>>
>>            lat (msec): min=2, max=2208, avg=19.39, stdev=74.34
>>
>>           clat percentiles (msec):
>>
>>            |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4],
>>       20.00th=[    4],
>>
>>            | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5],
>>       60.00th=[    5],
>>
>>            | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38],
>>       95.00th=[   63],
>>
>>            | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074],
>>       99.95th=[ 1221],
>>
>>            | 99.99th=[ 1532]
>>
>>           bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48,
>>       stdev=102.09
>>
>>           lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03%
>>
>>           lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%
>>
>>         cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928
>>
>>         IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
>>       32=0.0%, >=64=0.0%
>>
>>            submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>       64=0.0%, >=64=0.0%
>>
>>            complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>       64=0.0%, >=64=0.0%
>>
>>            issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0,
>>       drop=r=0/w=0/d=0
>>
>>            latency   : target=0, window=0, percentile=100.00%, depth=1
>>
>>       Run status group 0 (all jobs):
>>
>>         WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s,
>>       maxb=3389KB/s, mint=300011msec, maxt=300011msec
>>
>>       Disk stats (read/write):
>>
>>         rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847,
>>       in_queue=5677825, util=100.00%
>>
>>       And a job closer to what the actual workload would be
>>       (blocksize=200K, numjob=16, QD=32)
>>
>>       200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0:
>>       pid=4828: Wed Dec  2 18:58:53 2015
>>
>>         write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec
>>
>>           slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49
>>
>>           clat (msec): min=9, max=3584, avg=613.88, stdev=168.68
>>
>>            lat (msec): min=10, max=3584, avg=614.04, stdev=168.66
>>
>>           clat percentiles (msec):
>>
>>            |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502],
>>       20.00th=[  537],
>>
>>            | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594],
>>       60.00th=[  603],
>>
>>            | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701],
>>       95.00th=[  881],
>>
>>            | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638],
>>       99.95th=[ 2671],
>>
>>            | 99.99th=[ 2999]
>>
>>           bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40,
>>       stdev=2009.86
>>
>>           lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08%
>>
>>           lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%
>>
>>         cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433
>>
>>         IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%,
>>       32=99.8%, >=64=0.0%
>>
>>            submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>       64=0.0%, >=64=0.0%
>>
>>            complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
>>       64=0.0%, >=64=0.0%
>>
>>            issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0,
>>       drop=r=0/w=0/d=0
>>
>>            latency   : target=0, window=0, percentile=100.00%, depth=32
>>
>>       Run status group 0 (all jobs):
>>
>>         WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s,
>>       maxb=161367KB/s, mint=300189msec, maxt=300189msec
>>
>>       Disk stats (read/write):
>>
>>         rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593,
>>       in_queue=5887504, util=99.91%
>>
>>       The 4k block performance does not interest me so much but is given
>>       as a reference. I am more looking for throughput, but anyway, the 
>> numbers
>>       seem quite low.
>>
>>       Let's take IOPS, assuming the spinners can do 50 (4k) synced
>>       sustained IOPS (I hope they can do more ^^), we should be around 
>> 50x84/3 =
>>       1400 IOPS, which is far from rados bench (538) and fio (847). And
>>       surprisingly fio numbers are greater than rados.
>>
>>       So I don't know wether I am missing something here or if something
>>       is going wrong (maybe both !).
>>
>>       Any input would be very valuable.
>>
>>       Thank you,
>>
>>       Adrien << File: ATT00001.txt >>
>>
>>
>>
>
>
> --
>
> -----------------------------------------------------------------------------------------
> Adrien GILLARD
>
> +33 (0)6 29 06 16 31
> gillard.adr...@gmail.com
>



-- 
-----------------------------------------------------------------------------------------
Adrien GILLARD

+33 (0)6 29 06 16 31
gillard.adr...@gmail.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New cluster performance analysis

Reply via email to