I did some more tests :

fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS

I also tuned xfs mount options on client (I realized I didn't do that
already) and with
"largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime"
I get better performance :

4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu Dec
 3 10:45:55 2015
  write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec
    slat (usec): min=5, max=1620, avg=41.61, stdev=25.82
    clat (msec): min=1, max=4141, avg=14.61, stdev=112.55
     lat (msec): min=1, max=4141, avg=14.65, stdev=112.55
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    6], 90.00th=[    7], 95.00th=[    7],
     | 99.00th=[  227], 99.50th=[  717], 99.90th=[ 1844], 99.95th=[ 2245],
     | 99.99th=[ 3097]

So, more than 50% improvement but it actually varies quite a lot between
tests (sometimes I get a bit more than 1000). If I run the test fo 30
minutes it drops to 900 IOPS.

As you suggested I also filled a volume with zeros (dd if=/dev/zero
of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot
of improvement.

If I run fio test directly on block devices I seem to saturate the
spinners, [1] is a graph of IO load on one of the OSD host.
[2] is the same OSD graph but when the test is done on a device mounted and
formatted with XFS on the client.
If I get half of the IOPS on the XFS volume because of the journal,
shouldn't I get the same amount of IOPS on the backend ?
[3] shows what happen if I run the test for 30 minutes.

During the fio tests on the raw device, load average on the OSD servers
increases up to 13/14 and I get a bit of iowait (I guess because the OSD
are busy)
During the fio tests on the raw device, load average on the OSD servers
peaks at the beginning and decreases to 5/6, but goes trough the roof on
the client.
Scheduler is deadline for all the drives, I didn't try to change it yet.

What I don't understand, even with your explanations, are the rados
results. From what I understand it performs at the RADOS level and thus
should not be impacted by client filesystem.
Given the results above I guess you are right and this has to do with the
client filesystem.

The cluster will be used for backups, write IO size during backups is
around 150/200K (I guess mostly sequential) and I am looking for the
highest bandwith and parallelization.

@Nick, I will try to create a new stand alone replicated pool.


[1] http://postimg.org/image/qvtvdq1n1/
[2] http://postimg.org/image/nhf6lzwgl/
[3] http://postimg.org/image/h7l0obw7h/

On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk <n...@fisk.me.uk> wrote:

> Couple of things to check
>
> 1.      Can you create just a normal non cached pool and test performance
> to rule out any funnies going on there.
>
> 2.      Can you also run something like iostat during the benchmarks and
> see if it looks like all your disks are getting saturated.
>
>
>
>    _____________________________________________
>       *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
>       <ceph-users-boun...@lists.ceph.com>]* On Behalf Of* Adrien Gillard
>       *Sent:* 02 December 2015 21:33
>       *To:* ceph-us...@ceph.com
>       *Subject:* [ceph-users] New cluster performance analysis
>
>       Hi everyone,
>
>
>
>       I am currently testing our new cluster and I would like some
>       feedback on the numbers I am getting.
>
>
>
>       For the hardware :
>
>       7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for
>       public net., 2x10Gbits LACP for cluster net., MTU 9000
>
>       1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD,
>       2x10Gbits LACP for public net., MTU 9000
>
>       2 x MON : VMs (8 cores, 8GB RAM), backed by SSD
>
>
>
>       Journals are 20GB partitions on SSD
>
>
>
>       The system is CentOS 7.1 with stock kernel
>       (3.10.0-229.20.1.el7.x86_64). No particular system optimizations.
>
>
>
>       Ceph is Infernalis from Ceph repository  : ceph version 9.2.0
>       (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>
>
>
>       [cephadm@cph-adm-01  ~/scripts]$ ceph -s
>
>           cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
>
>            health HEALTH_OK
>
>            monmap e1: 3 mons at
>       
> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
>
>                   election epoch 62, quorum 0,1,2
>       clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
>
>            osdmap e844: 84 osds: 84 up, 84 in
>
>                   flags sortbitwise
>
>             pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 kobjects
>
>                   8308 GB used, 297 TB / 305 TB avail
>
>                       3136 active+clean
>
>
>
>       My ceph.conf :
>
>
>
>       [global]
>
>       fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
>
>       mon_initial_members = clb-cph-frpar2-mon-01, clb-cph-frpar1-mon-02,
>       clb-cph-frpar2-mon-03
>
>       mon_host = x.x.x.1,x.x.x.2,x.x.x.3
>
>       auth_cluster_required = cephx
>
>       auth_service_required = cephx
>
>       auth_client_required = cephx
>
>       filestore_xattr_use_omap = true
>
>       public network = *10.25.25.0/24*
>       
> <http://xo4t.mj.am/link/xo4t/r36izu2/1/7CTRRFQ_Kf3wNWPYd7QbWA/aHR0cDovLzEwLjI1LjI1LjAvMjQ>
>
>       cluster network = *10.25.26.0/24*
>       
> <http://xo4t.mj.am/link/xo4t/r36izu2/2/HsIkz-1efpdFK1tjIFEU0A/aHR0cDovLzEwLjI1LjI2LjAvMjQ>
>
>       debug_lockdep = 0/0
>
>       debug_context = 0/0
>
>       debug_crush = 0/0
>
>       debug_buffer = 0/0
>
>       debug_timer = 0/0
>
>       debug_filer = 0/0
>
>       debug_objecter = 0/0
>
>       debug_rados = 0/0
>
>       debug_rbd = 0/0
>
>       debug_journaler = 0/0
>
>       debug_objectcatcher = 0/0
>
>       debug_client = 0/0
>
>       debug_osd = 0/0
>
>       debug_optracker = 0/0
>
>       debug_objclass = 0/0
>
>       debug_filestore = 0/0
>
>       debug_journal = 0/0
>
>       debug_ms = 0/0
>
>       debug_monc = 0/0
>
>       debug_tp = 0/0
>
>       debug_auth = 0/0
>
>       debug_finisher = 0/0
>
>       debug_heartbeatmap = 0/0
>
>       debug_perfcounter = 0/0
>
>       debug_asok = 0/0
>
>       debug_throttle = 0/0
>
>       debug_mon = 0/0
>
>       debug_paxos = 0/0
>
>       debug_rgw = 0/0
>
>
>
>       [osd]
>
>       osd journal size = 0
>
>       osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k"
>
>       filestore min sync interval = 5
>
>       filestore max sync interval = 15
>
>       filestore queue max ops = 2048
>
>       filestore queue max bytes = 1048576000
>
>       filestore queue committing max ops = 4096
>
>       filestore queue committing max bytes = 1048576000
>
>       filestore op thread = 32
>
>       filestore journal writeahead = true
>
>       filestore merge threshold = 40
>
>       filestore split multiple = 8
>
>
>
>       journal max write bytes = 1048576000
>
>       journal max write entries = 4096
>
>       journal queue max ops = 8092
>
>       journal queue max bytes = 1048576000
>
>
>
>       osd max write size = 512
>
>       osd op threads = 16
>
>       osd disk threads = 2
>
>       osd op num threads per shard = 3
>
>       osd op num shards = 10
>
>       osd map cache size = 1024
>
>       osd max backfills = 1
>
>       osd recovery max active = 2
>
>
>
>       I have set up 2 pools : one for cache with 3x replication in front
>       of an EC pool. At the moment I am only interested in the cache pool, so 
> no
>       promotions/flushes/evictions happen.
>
>       (I know, I am using the same set of OSD for hot and cold data, but
>       in my use case they should not be used at the same time.)
>
>
>
>       I am accessing the cluster via RBD volumes mapped with the kernel
>       module on CentOS 7.1. These volumes are formatted in XFS on the clients.
>
>
>
>       The journal SSDs seem to perform quite well according to the
>       results of Sebastien Han’s benchmark suggestion (they are Sandisk) :
>
>       write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec (this
>       is for numjob=10)
>
>
>
>       Here are the rados bench tests :
>
>
>
>       rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup
>
>       Total time run:         121.410763
>
>       Total writes made:      65357
>
>       Write size:             4096
>
>       Bandwidth (MB/sec):     2.1
>
>       Stddev Bandwidth:       0.597
>
>       Max bandwidth (MB/sec): 3.89
>
>       Min bandwidth (MB/sec): 0.00781
>
>       Average IOPS:           538
>
>       Stddev IOPS:            152
>
>       Max IOPS:               995
>
>       Min IOPS:               2
>
>       Average Latency:        0.0594
>
>       Stddev Latency:         0.18
>
>       Max latency:            2.82
>
>       Min latency:            0.00494
>
>
>
>       And the results of the fio test with the following parameters :
>
>
>
>       [global]
>
>       size=8G
>
>       runtime=300
>
>       ioengine=libaio
>
>       invalidate=1
>
>       direct=1
>
>       sync=1
>
>       fsync=1
>
>       numjobs=32
>
>       rw=randwrite
>
>       name=4k-32-1-randwrite-libaio
>
>       blocksize=4K
>
>       iodepth=1
>
>       directory=/mnt/rbd
>
>       group_reporting=1
>
>       4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442:
>       Wed Dec  2 21:38:30 2015
>
>         write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec
>
>           slat (usec): min=5, max=4726, avg=40.32, stdev=41.28
>
>           clat (msec): min=2, max=2208, avg=19.35, stdev=74.34
>
>            lat (msec): min=2, max=2208, avg=19.39, stdev=74.34
>
>           clat percentiles (msec):
>
>            |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[
>          4],
>
>            | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[
>          5],
>
>            | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38], 95.00th=[
>         63],
>
>            | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074], 99.95th=[
>       1221],
>
>            | 99.99th=[ 1532]
>
>           bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48,
>       stdev=102.09
>
>           lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03%
>
>           lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%
>
>         cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928
>
>         IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
>       32=0.0%, >=64=0.0%
>
>            submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>       64=0.0%, >=64=0.0%
>
>            complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>       64=0.0%, >=64=0.0%
>
>            issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0,
>       drop=r=0/w=0/d=0
>
>            latency   : target=0, window=0, percentile=100.00%, depth=1
>
>       Run status group 0 (all jobs):
>
>         WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, maxb=3389KB/s,
>       mint=300011msec, maxt=300011msec
>
>       Disk stats (read/write):
>
>         rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847,
>       in_queue=5677825, util=100.00%
>
>       And a job closer to what the actual workload would be
>       (blocksize=200K, numjob=16, QD=32)
>
>       200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0:
>       pid=4828: Wed Dec  2 18:58:53 2015
>
>         write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec
>
>           slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49
>
>           clat (msec): min=9, max=3584, avg=613.88, stdev=168.68
>
>            lat (msec): min=10, max=3584, avg=614.04, stdev=168.66
>
>           clat percentiles (msec):
>
>            |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502], 20.00th=[
>        537],
>
>            | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594], 60.00th=[
>        603],
>
>            | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701], 95.00th=[
>        881],
>
>            | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], 99.95th=[
>       2671],
>
>            | 99.99th=[ 2999]
>
>           bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40,
>       stdev=2009.86
>
>           lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08%
>
>           lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%
>
>         cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433
>
>         IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%,
>       >=64=0.0%
>
>            submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>       64=0.0%, >=64=0.0%
>
>            complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%,
>       64=0.0%, >=64=0.0%
>
>            issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0,
>       drop=r=0/w=0/d=0
>
>            latency   : target=0, window=0, percentile=100.00%, depth=32
>
>       Run status group 0 (all jobs):
>
>         WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s,
>       maxb=161367KB/s, mint=300189msec, maxt=300189msec
>
>       Disk stats (read/write):
>
>         rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593,
>       in_queue=5887504, util=99.91%
>
>       The 4k block performance does not interest me so much but is given
>       as a reference. I am more looking for throughput, but anyway, the 
> numbers
>       seem quite low.
>
>       Let's take IOPS, assuming the spinners can do 50 (4k) synced
>       sustained IOPS (I hope they can do more ^^), we should be around 
> 50x84/3 =
>       1400 IOPS, which is far from rados bench (538) and fio (847). And
>       surprisingly fio numbers are greater than rados.
>
>       So I don't know wether I am missing something here or if something
>       is going wrong (maybe both !).
>
>       Any input would be very valuable.
>
>       Thank you,
>
>       Adrien << File: ATT00001.txt >>
>
>
>


-- 
-----------------------------------------------------------------------------------------
Adrien GILLARD

+33 (0)6 29 06 16 31
gillard.adr...@gmail.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to