Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Somnath Roy Thu, 28 Aug 2014 10:51:00 -0700

Yes, Mark, all of my changes are in ceph main now and we are getting 
significant RR performance improvement with that.


Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Thursday, August 28, 2014 10:43 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS

On 08/28/2014 12:39 PM, Somnath Roy wrote:
> Hi Sebastian,
> If you are trying with the latest Ceph master, there are some changes we made 
> that will be increasing your read performance from SSD a factor of ~5X if the 
> ios are hitting the disks. Otherwise, the serving from memory the improvement 
> is even more. The single OSD will be cpu bound with increasing number of 
> clients eventually both reading from disk and memory scenario. Some new 
> config option are introduced and here are those.
>
>          osd_op_num_threads_per_shard
>          osd_op_num_shards
>          throttler_perf_counter
>          osd_enable_op_tracker
>          filestore_fd_cache_size
>          filestore_fd_cache_shards
>
> The work pool for the io path is now sharded and the above options are for 
> controlling this. Osd_op_threads are no longer in the io path. Also, the 
> filestore FDcache is sharded now.
> In my setup(64GB RAM, 40 core CPU with HT enabled)  the following config file 
> on a single OSD is giving optimum result for 4k RR read.
>
> [global]
>
>          filestore_xattr_use_omap = true
>
>          debug_lockdep = 0/0
>          debug_context = 0/0
>          debug_crush = 0/0
>          debug_buffer = 0/0
>          debug_timer = 0/0
>          debug_filer = 0/0
>          debug_objecter = 0/0
>          debug_rados = 0/0
>          debug_rbd = 0/0
>          debug_journaler = 0/0
>          debug_objectcatcher = 0/0
>          debug_client = 0/0
>          debug_osd = 0/0
>          debug_optracker = 0/0
>          debug_objclass = 0/0
>          debug_filestore = 0/0
>          debug_journal = 0/0
>          debug_ms = 0/0
>          debug_monc = 0/0
>          debug_tp = 0/0
>          debug_auth = 0/0
>          debug_finisher = 0/0
>          debug_heartbeatmap = 0/0
>          debug_perfcounter = 0/0
>          debug_asok = 0/0
>          debug_throttle = 0/0
>          debug_mon = 0/0
>          debug_paxos = 0/0
>          debug_rgw = 0/0
>          osd_op_threads = 5
>          osd_op_num_threads_per_shard = 1
>          osd_op_num_shards = 25
>          #osd_op_num_sharded_pool_threads = 25
>          filestore_op_threads = 4
>
>          ms_nocrc = true
>          filestore_fd_cache_size = 64
>          filestore_fd_cache_shards = 32
>          cephx sign messages = false
>          cephx require signatures = false
>
>          ms_dispatch_throttle_bytes = 0
>          throttler_perf_counter = false
>
>
> [osd]
>          osd_client_message_size_cap = 0
>          osd_client_message_cap = 0
>          osd_enable_op_tracker = false
>
>
> What I saw optracker is one of the major bottleneck and we are in process of 
> optimizing that. For now, optracker enabled/disabled code introduced. Also, 
> there are several bottlenecks in the filestore level are removed.
> Unfortunately, we are yet to optimize the write path. All of these should 
> help the write path as well, but, write path improvement will not be visible 
> till all the lock serialization are removed.

This is what I'm waiting for. :)  I've been meaning to ask you Somnath, how 
goes progress?

Mark

>
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Sebastien Han
> Sent: Thursday, August 28, 2014 9:12 AM
> To: ceph-users
> Cc: Mark Nelson
> Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 
> 2K IOPS
>
> Hey all,
>
> It has been a while since the last thread performance related on the ML :p 
> I've been running some experiment to see how much I can get from an SSD on a 
> Ceph cluster.
> To achieve that I did something pretty simple:
>
> * Debian wheezy 7.6
> * kernel from debian 3.14-0.bpo.2-amd64
> * 1 cluster, 3 mons (i'd like to keep this realistic since in a real 
> deployment i'll use 3)
> * 1 OSD backed by an SSD (journal and osd data on the same device)
> * 1 replica count of 1
> * partitions are perfectly aligned
> * io scheduler is set to noon but deadline was showing the same 
> results
> * no updatedb running
>
> About the box:
>
> * 32GB of RAM
> * 12 cores with HT @ 2,4 GHz
> * WB cache is enabled on the controller
> * 10Gbps network (doesn't help here)
>
> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops 
> with random 4k writes (my fio results) As a benchmark tool I used fio with 
> the rbd engine (thanks deutsche telekom guys!).
>
> O_DIECT and D_SYNC don't seem to be a problem for the SSD:
>
> # dd if=/dev/urandom of=rand.file bs=4k count=65536
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s
>
> # du -sh rand.file
> 256M    rand.file
>
> # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct
> 65536+0 records in
> 65536+0 records out
> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s
>
> See my ceph.conf:
>
> [global]
>    auth cluster required = cephx
>    auth service required = cephx
>    auth client required = cephx
>    fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97
>    osd pool default pg num = 4096
>    osd pool default pgp num = 4096
>    osd pool default size = 2
>    osd crush chooseleaf type = 0
>
>     debug lockdep = 0/0
>          debug context = 0/0
>          debug crush = 0/0
>          debug buffer = 0/0
>          debug timer = 0/0
>          debug journaler = 0/0
>          debug osd = 0/0
>          debug optracker = 0/0
>          debug objclass = 0/0
>          debug filestore = 0/0
>          debug journal = 0/0
>          debug ms = 0/0
>          debug monc = 0/0
>          debug tp = 0/0
>          debug auth = 0/0
>          debug finisher = 0/0
>          debug heartbeatmap = 0/0
>          debug perfcounter = 0/0
>          debug asok = 0/0
>          debug throttle = 0/0
>
> [mon]
>    mon osd down out interval = 600
>    mon osd min down reporters = 13
>      [mon.ceph-01]
>      host = ceph-01
>      mon addr = 172.20.20.171
>        [mon.ceph-02]
>      host = ceph-02
>      mon addr = 172.20.20.172
>        [mon.ceph-03]
>      host = ceph-03
>      mon addr = 172.20.20.173
>
>          debug lockdep = 0/0
>          debug context = 0/0
>          debug crush = 0/0
>          debug buffer = 0/0
>          debug timer = 0/0
>          debug journaler = 0/0
>          debug osd = 0/0
>          debug optracker = 0/0
>          debug objclass = 0/0
>          debug filestore = 0/0
>          debug journal = 0/0
>          debug ms = 0/0
>          debug monc = 0/0
>          debug tp = 0/0
>          debug auth = 0/0
>          debug finisher = 0/0
>          debug heartbeatmap = 0/0
>          debug perfcounter = 0/0
>          debug asok = 0/0
>          debug throttle = 0/0
>
> [osd]
>    osd mkfs type = xfs
> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = 
> rw,noatime,logbsize=256k,delaylog
>    osd journal size = 20480
>    cluster_network = 172.20.20.0/24
>    public_network = 172.20.20.0/24
>    osd mon heartbeat interval = 30
>    # Performance tuning
>    filestore merge threshold = 40
>    filestore split multiple = 8
>    osd op threads = 8
>    # Recovery tuning
>    osd recovery max active = 1
>    osd max backfills = 1
>    osd recovery op priority = 1
>
>
>          debug lockdep = 0/0
>          debug context = 0/0
>          debug crush = 0/0
>          debug buffer = 0/0
>          debug timer = 0/0
>          debug journaler = 0/0
>          debug osd = 0/0
>          debug optracker = 0/0
>          debug objclass = 0/0
>          debug filestore = 0/0
>          debug journal = 0/0
>          debug ms = 0/0
>          debug monc = 0/0
>          debug tp = 0/0
>          debug auth = 0/0
>          debug finisher = 0/0
>          debug heartbeatmap = 0/0
>          debug perfcounter = 0/0
>          debug asok = 0/0
>          debug throttle = 0/0
>
> Disabling all debugging made me win 200/300 more IOPS.
>
> See my fio template:
>
> [global]
> #logging
> #write_iops_log=write_iops_log
> #write_bw_log=write_bw_log
> #write_lat_log=write_lat_lo
>
> time_based
> runtime=60
>
> ioengine=rbd
> clientname=admin
> pool=test
> rbdname=fio
> invalidate=0    # mandatory
> #rw=randwrite
> rw=write
> bs=4k
> #bs=32m
> size=5G
> group_reporting
>
> [rbd_iodepth32]
> iodepth=32
> direct=1
>
> See my rio output:
>
> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
> iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD 
> version: 0.1.8
> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 
> 2014
>    write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec
>      slat (usec): min=42, max=1578, avg=66.50, stdev=16.96
>      clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48
>       lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47
>      clat percentiles (usec):
>       |  1.00th=[ 6368],  5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152],
>       | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048],
>       | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456],
>       | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008],
>       | 99.99th=[28032]
>      bw (KB  /s): min=11864, max=13808, per=100.00%, avg=12864.36, 
> stdev=407.35
>      lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41%
>    cpu          : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, 
> >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>       complete  : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, 
> >=64=0.0%
>       issued    : total=r=0/w=192862/d=0, short=r=0/w=0/d=0
>       latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, 
> maxb=12855KB/s, mint=60010msec, maxt=60010msec
>
> Disk stats (read/write):
>      dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, 
> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, aggrutil=0.01%
>    sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01%
>
> I tried to tweak several parameters like:
>
> filestore_wbthrottle_xfs_ios_start_flusher = 10000 
> filestore_wbthrottle_xfs_ios_hard_limit = 10000 
> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 
> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max 
> ops = 2000
>
> But didn't any improvement.
>
> Then I tried other things:
>
> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more 
> IOPS but it's not a realistic workload anymore and not that significant.
> * adding another SSD for the journal, still getting 3,2K IOPS
> * I tried with rbd bench and I also got 3K IOPS
> * I ran the test on a client machine and then locally on the server, 
> still getting 3,2K IOPS
> * put the journal in memory, still getting 3,2K IOPS
> * with 2 clients running the test in parallel I got a total of 3,6K 
> IOPS but I don't seem to be able to go over
> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals on 
> 1 SSD, got 4,5K IOPS YAY!
>
> Given the results of the last time it seems that something is limiting the 
> number of IOPS per OSD process.
>
> Running the test on a client or locally didn't show any difference.
> So it looks to me that there is some contention within Ceph that might cause 
> this.
>
> I also ran perf and looked at the output, everything looks decent, but 
> someone might want to have a look at it :).
>
> We have been able to reproduce this on 3 distinct platforms with some 
> deviations (because of the hardware) but the behaviour is the same.
> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS 
> SSD is a bit frustrating :).
>
> Cheers.
> ----
> Sébastien Han
> Cloud Architect
>
> "Always give 100%. Unless you're giving blood."
>
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien....@enovance.com
> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - 
> Twitter : @enovance
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Reply via email to