After some more tests : - The pool being used as cache pool has no impact on performance, I get the same results with a "dedicated" replicated pool. - You are right Jan, on raw devices I get better performance on a volume if I fill it first, or at least if I write a zone that already has been allocated - The same seem to apply when the test is run on the mounted filesystem.
On Thu, Dec 3, 2015 at 2:49 PM, Adrien Gillard <gillard.adr...@gmail.com> wrote: > I did some more tests : > > fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS > > I also tuned xfs mount options on client (I realized I didn't do that > already) and with > "largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime" > I get better performance : > > 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu Dec > 3 10:45:55 2015 > write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec > slat (usec): min=5, max=1620, avg=41.61, stdev=25.82 > clat (msec): min=1, max=4141, avg=14.61, stdev=112.55 > lat (msec): min=1, max=4141, avg=14.65, stdev=112.55 > clat percentiles (msec): > | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], > | 70.00th=[ 5], 80.00th=[ 6], 90.00th=[ 7], 95.00th=[ 7], > | 99.00th=[ 227], 99.50th=[ 717], 99.90th=[ 1844], 99.95th=[ 2245], > | 99.99th=[ 3097] > > So, more than 50% improvement but it actually varies quite a lot between > tests (sometimes I get a bit more than 1000). If I run the test fo 30 > minutes it drops to 900 IOPS. > > As you suggested I also filled a volume with zeros (dd if=/dev/zero > of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot > of improvement. > > If I run fio test directly on block devices I seem to saturate the > spinners, [1] is a graph of IO load on one of the OSD host. > [2] is the same OSD graph but when the test is done on a device mounted > and formatted with XFS on the client. > If I get half of the IOPS on the XFS volume because of the journal, > shouldn't I get the same amount of IOPS on the backend ? > [3] shows what happen if I run the test for 30 minutes. > > During the fio tests on the raw device, load average on the OSD servers > increases up to 13/14 and I get a bit of iowait (I guess because the OSD > are busy) > During the fio tests on the raw device, load average on the OSD servers > peaks at the beginning and decreases to 5/6, but goes trough the roof on > the client. > Scheduler is deadline for all the drives, I didn't try to change it yet. > > What I don't understand, even with your explanations, are the rados > results. From what I understand it performs at the RADOS level and thus > should not be impacted by client filesystem. > Given the results above I guess you are right and this has to do with the > client filesystem. > > The cluster will be used for backups, write IO size during backups is > around 150/200K (I guess mostly sequential) and I am looking for the > highest bandwith and parallelization. > > @Nick, I will try to create a new stand alone replicated pool. > > > [1] http://postimg.org/image/qvtvdq1n1/ > [2] http://postimg.org/image/nhf6lzwgl/ > [3] http://postimg.org/image/h7l0obw7h/ > > On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk <n...@fisk.me.uk> wrote: > >> Couple of things to check >> >> 1. Can you create just a normal non cached pool and test >> performance to rule out any funnies going on there. >> >> 2. Can you also run something like iostat during the benchmarks and >> see if it looks like all your disks are getting saturated. >> >> >> >> _____________________________________________ >> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com >> <ceph-users-boun...@lists.ceph.com>]* On Behalf Of* Adrien Gillard >> *Sent:* 02 December 2015 21:33 >> *To:* ceph-us...@ceph.com >> *Subject:* [ceph-users] New cluster performance analysis >> >> Hi everyone, >> >> >> >> I am currently testing our new cluster and I would like some >> feedback on the numbers I am getting. >> >> >> >> For the hardware : >> >> 7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for >> public net., 2x10Gbits LACP for cluster net., MTU 9000 >> >> 1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, >> 2x10Gbits LACP for public net., MTU 9000 >> >> 2 x MON : VMs (8 cores, 8GB RAM), backed by SSD >> >> >> >> Journals are 20GB partitions on SSD >> >> >> >> The system is CentOS 7.1 with stock kernel >> (3.10.0-229.20.1.el7.x86_64). No particular system optimizations. >> >> >> >> Ceph is Infernalis from Ceph repository : ceph version 9.2.0 >> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) >> >> >> >> [cephadm@cph-adm-01 ~/scripts]$ ceph -s >> >> cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce >> >> health HEALTH_OK >> >> monmap e1: 3 mons at >> >> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0} >> >> election epoch 62, quorum 0,1,2 >> clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03 >> >> osdmap e844: 84 osds: 84 up, 84 in >> >> flags sortbitwise >> >> pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 >> kobjects >> >> 8308 GB used, 297 TB / 305 TB avail >> >> 3136 active+clean >> >> >> >> My ceph.conf : >> >> >> >> [global] >> >> fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce >> >> mon_initial_members = clb-cph-frpar2-mon-01, >> clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03 >> >> mon_host = x.x.x.1,x.x.x.2,x.x.x.3 >> >> auth_cluster_required = cephx >> >> auth_service_required = cephx >> >> auth_client_required = cephx >> >> filestore_xattr_use_omap = true >> >> public network = *10.25.25.0/24* >> >> <http://xo4t.mj.am/link/xo4t/r36izu2/1/7CTRRFQ_Kf3wNWPYd7QbWA/aHR0cDovLzEwLjI1LjI1LjAvMjQ> >> >> cluster network = *10.25.26.0/24* >> >> <http://xo4t.mj.am/link/xo4t/r36izu2/2/HsIkz-1efpdFK1tjIFEU0A/aHR0cDovLzEwLjI1LjI2LjAvMjQ> >> >> debug_lockdep = 0/0 >> >> debug_context = 0/0 >> >> debug_crush = 0/0 >> >> debug_buffer = 0/0 >> >> debug_timer = 0/0 >> >> debug_filer = 0/0 >> >> debug_objecter = 0/0 >> >> debug_rados = 0/0 >> >> debug_rbd = 0/0 >> >> debug_journaler = 0/0 >> >> debug_objectcatcher = 0/0 >> >> debug_client = 0/0 >> >> debug_osd = 0/0 >> >> debug_optracker = 0/0 >> >> debug_objclass = 0/0 >> >> debug_filestore = 0/0 >> >> debug_journal = 0/0 >> >> debug_ms = 0/0 >> >> debug_monc = 0/0 >> >> debug_tp = 0/0 >> >> debug_auth = 0/0 >> >> debug_finisher = 0/0 >> >> debug_heartbeatmap = 0/0 >> >> debug_perfcounter = 0/0 >> >> debug_asok = 0/0 >> >> debug_throttle = 0/0 >> >> debug_mon = 0/0 >> >> debug_paxos = 0/0 >> >> debug_rgw = 0/0 >> >> >> >> [osd] >> >> osd journal size = 0 >> >> osd mount options xfs = >> "rw,noatime,inode64,logbufs=8,logbsize=256k" >> >> filestore min sync interval = 5 >> >> filestore max sync interval = 15 >> >> filestore queue max ops = 2048 >> >> filestore queue max bytes = 1048576000 >> >> filestore queue committing max ops = 4096 >> >> filestore queue committing max bytes = 1048576000 >> >> filestore op thread = 32 >> >> filestore journal writeahead = true >> >> filestore merge threshold = 40 >> >> filestore split multiple = 8 >> >> >> >> journal max write bytes = 1048576000 >> >> journal max write entries = 4096 >> >> journal queue max ops = 8092 >> >> journal queue max bytes = 1048576000 >> >> >> >> osd max write size = 512 >> >> osd op threads = 16 >> >> osd disk threads = 2 >> >> osd op num threads per shard = 3 >> >> osd op num shards = 10 >> >> osd map cache size = 1024 >> >> osd max backfills = 1 >> >> osd recovery max active = 2 >> >> >> >> I have set up 2 pools : one for cache with 3x replication in front >> of an EC pool. At the moment I am only interested in the cache pool, >> so no >> promotions/flushes/evictions happen. >> >> (I know, I am using the same set of OSD for hot and cold data, but >> in my use case they should not be used at the same time.) >> >> >> >> I am accessing the cluster via RBD volumes mapped with the kernel >> module on CentOS 7.1. These volumes are formatted in XFS on the >> clients. >> >> >> >> The journal SSDs seem to perform quite well according to the >> results of Sebastien Han’s benchmark suggestion (they are Sandisk) : >> >> write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec >> (this is for numjob=10) >> >> >> >> Here are the rados bench tests : >> >> >> >> rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup >> >> Total time run: 121.410763 >> >> Total writes made: 65357 >> >> Write size: 4096 >> >> Bandwidth (MB/sec): 2.1 >> >> Stddev Bandwidth: 0.597 >> >> Max bandwidth (MB/sec): 3.89 >> >> Min bandwidth (MB/sec): 0.00781 >> >> Average IOPS: 538 >> >> Stddev IOPS: 152 >> >> Max IOPS: 995 >> >> Min IOPS: 2 >> >> Average Latency: 0.0594 >> >> Stddev Latency: 0.18 >> >> Max latency: 2.82 >> >> Min latency: 0.00494 >> >> >> >> And the results of the fio test with the following parameters : >> >> >> >> [global] >> >> size=8G >> >> runtime=300 >> >> ioengine=libaio >> >> invalidate=1 >> >> direct=1 >> >> sync=1 >> >> fsync=1 >> >> numjobs=32 >> >> rw=randwrite >> >> name=4k-32-1-randwrite-libaio >> >> blocksize=4K >> >> iodepth=1 >> >> directory=/mnt/rbd >> >> group_reporting=1 >> >> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442: >> Wed Dec 2 21:38:30 2015 >> >> write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec >> >> slat (usec): min=5, max=4726, avg=40.32, stdev=41.28 >> >> clat (msec): min=2, max=2208, avg=19.35, stdev=74.34 >> >> lat (msec): min=2, max=2208, avg=19.39, stdev=74.34 >> >> clat percentiles (msec): >> >> | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], >> 20.00th=[ 4], >> >> | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], >> 60.00th=[ 5], >> >> | 70.00th=[ 6], 80.00th=[ 7], 90.00th=[ 38], >> 95.00th=[ 63], >> >> | 99.00th=[ 322], 99.50th=[ 570], 99.90th=[ 1074], >> 99.95th=[ 1221], >> >> | 99.99th=[ 1532] >> >> bw (KB /s): min= 1, max= 448, per=3.64%, avg=123.48, >> stdev=102.09 >> >> lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03% >> >> lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16% >> >> cpu : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928 >> >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, >> 32=0.0%, >=64=0.0% >> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> 64=0.0%, >=64=0.0% >> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> 64=0.0%, >=64=0.0% >> >> issued : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, >> drop=r=0/w=0/d=0 >> >> latency : target=0, window=0, percentile=100.00%, depth=1 >> >> Run status group 0 (all jobs): >> >> WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, >> maxb=3389KB/s, mint=300011msec, maxt=300011msec >> >> Disk stats (read/write): >> >> rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, >> in_queue=5677825, util=100.00% >> >> And a job closer to what the actual workload would be >> (blocksize=200K, numjob=16, QD=32) >> >> 200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: >> pid=4828: Wed Dec 2 18:58:53 2015 >> >> write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec >> >> slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49 >> >> clat (msec): min=9, max=3584, avg=613.88, stdev=168.68 >> >> lat (msec): min=10, max=3584, avg=614.04, stdev=168.66 >> >> clat percentiles (msec): >> >> | 1.00th=[ 375], 5.00th=[ 469], 10.00th=[ 502], >> 20.00th=[ 537], >> >> | 30.00th=[ 553], 40.00th=[ 578], 50.00th=[ 594], >> 60.00th=[ 603], >> >> | 70.00th=[ 627], 80.00th=[ 652], 90.00th=[ 701], >> 95.00th=[ 881], >> >> | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], >> 99.95th=[ 2671], >> >> | 99.99th=[ 2999] >> >> bw (KB /s): min= 260, max=18181, per=6.31%, avg=10189.40, >> stdev=2009.86 >> >> lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08% >> >> lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09% >> >> cpu : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433 >> >> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, >> 32=99.8%, >=64=0.0% >> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> 64=0.0%, >=64=0.0% >> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, >> 64=0.0%, >=64=0.0% >> >> issued : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, >> drop=r=0/w=0/d=0 >> >> latency : target=0, window=0, percentile=100.00%, depth=32 >> >> Run status group 0 (all jobs): >> >> WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, >> maxb=161367KB/s, mint=300189msec, maxt=300189msec >> >> Disk stats (read/write): >> >> rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, >> in_queue=5887504, util=99.91% >> >> The 4k block performance does not interest me so much but is given >> as a reference. I am more looking for throughput, but anyway, the >> numbers >> seem quite low. >> >> Let's take IOPS, assuming the spinners can do 50 (4k) synced >> sustained IOPS (I hope they can do more ^^), we should be around >> 50x84/3 = >> 1400 IOPS, which is far from rados bench (538) and fio (847). And >> surprisingly fio numbers are greater than rados. >> >> So I don't know wether I am missing something here or if something >> is going wrong (maybe both !). >> >> Any input would be very valuable. >> >> Thank you, >> >> Adrien << File: ATT00001.txt >> >> >> >> > > > -- > > ----------------------------------------------------------------------------------------- > Adrien GILLARD > > +33 (0)6 29 06 16 31 > gillard.adr...@gmail.com > -- ----------------------------------------------------------------------------------------- Adrien GILLARD +33 (0)6 29 06 16 31 gillard.adr...@gmail.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com