Yes, Mark, all of my changes are in ceph main now and we are getting significant RR performance improvement with that.
Thanks & Regards Somnath -----Original Message----- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: Thursday, August 28, 2014 10:43 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS On 08/28/2014 12:39 PM, Somnath Roy wrote: > Hi Sebastian, > If you are trying with the latest Ceph master, there are some changes we made > that will be increasing your read performance from SSD a factor of ~5X if the > ios are hitting the disks. Otherwise, the serving from memory the improvement > is even more. The single OSD will be cpu bound with increasing number of > clients eventually both reading from disk and memory scenario. Some new > config option are introduced and here are those. > > osd_op_num_threads_per_shard > osd_op_num_shards > throttler_perf_counter > osd_enable_op_tracker > filestore_fd_cache_size > filestore_fd_cache_shards > > The work pool for the io path is now sharded and the above options are for > controlling this. Osd_op_threads are no longer in the io path. Also, the > filestore FDcache is sharded now. > In my setup(64GB RAM, 40 core CPU with HT enabled) the following config file > on a single OSD is giving optimum result for 4k RR read. > > [global] > > filestore_xattr_use_omap = true > > debug_lockdep = 0/0 > debug_context = 0/0 > debug_crush = 0/0 > debug_buffer = 0/0 > debug_timer = 0/0 > debug_filer = 0/0 > debug_objecter = 0/0 > debug_rados = 0/0 > debug_rbd = 0/0 > debug_journaler = 0/0 > debug_objectcatcher = 0/0 > debug_client = 0/0 > debug_osd = 0/0 > debug_optracker = 0/0 > debug_objclass = 0/0 > debug_filestore = 0/0 > debug_journal = 0/0 > debug_ms = 0/0 > debug_monc = 0/0 > debug_tp = 0/0 > debug_auth = 0/0 > debug_finisher = 0/0 > debug_heartbeatmap = 0/0 > debug_perfcounter = 0/0 > debug_asok = 0/0 > debug_throttle = 0/0 > debug_mon = 0/0 > debug_paxos = 0/0 > debug_rgw = 0/0 > osd_op_threads = 5 > osd_op_num_threads_per_shard = 1 > osd_op_num_shards = 25 > #osd_op_num_sharded_pool_threads = 25 > filestore_op_threads = 4 > > ms_nocrc = true > filestore_fd_cache_size = 64 > filestore_fd_cache_shards = 32 > cephx sign messages = false > cephx require signatures = false > > ms_dispatch_throttle_bytes = 0 > throttler_perf_counter = false > > > [osd] > osd_client_message_size_cap = 0 > osd_client_message_cap = 0 > osd_enable_op_tracker = false > > > What I saw optracker is one of the major bottleneck and we are in process of > optimizing that. For now, optracker enabled/disabled code introduced. Also, > there are several bottlenecks in the filestore level are removed. > Unfortunately, we are yet to optimize the write path. All of these should > help the write path as well, but, write path improvement will not be visible > till all the lock serialization are removed. This is what I'm waiting for. :) I've been meaning to ask you Somnath, how goes progress? Mark > > Thanks & Regards > Somnath > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Sebastien Han > Sent: Thursday, August 28, 2014 9:12 AM > To: ceph-users > Cc: Mark Nelson > Subject: [ceph-users] [Single OSD performance on SSD] Can't go over 3, > 2K IOPS > > Hey all, > > It has been a while since the last thread performance related on the ML :p > I've been running some experiment to see how much I can get from an SSD on a > Ceph cluster. > To achieve that I did something pretty simple: > > * Debian wheezy 7.6 > * kernel from debian 3.14-0.bpo.2-amd64 > * 1 cluster, 3 mons (i'd like to keep this realistic since in a real > deployment i'll use 3) > * 1 OSD backed by an SSD (journal and osd data on the same device) > * 1 replica count of 1 > * partitions are perfectly aligned > * io scheduler is set to noon but deadline was showing the same > results > * no updatedb running > > About the box: > > * 32GB of RAM > * 12 cores with HT @ 2,4 GHz > * WB cache is enabled on the controller > * 10Gbps network (doesn't help here) > > The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K iops > with random 4k writes (my fio results) As a benchmark tool I used fio with > the rbd engine (thanks deutsche telekom guys!). > > O_DIECT and D_SYNC don't seem to be a problem for the SSD: > > # dd if=/dev/urandom of=rand.file bs=4k count=65536 > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s > > # du -sh rand.file > 256M rand.file > > # dd if=rand.file of=/dev/sdo bs=4k count=65536 oflag=dsync,direct > 65536+0 records in > 65536+0 records out > 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s > > See my ceph.conf: > > [global] > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 > osd pool default pg num = 4096 > osd pool default pgp num = 4096 > osd pool default size = 2 > osd crush chooseleaf type = 0 > > debug lockdep = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug buffer = 0/0 > debug timer = 0/0 > debug journaler = 0/0 > debug osd = 0/0 > debug optracker = 0/0 > debug objclass = 0/0 > debug filestore = 0/0 > debug journal = 0/0 > debug ms = 0/0 > debug monc = 0/0 > debug tp = 0/0 > debug auth = 0/0 > debug finisher = 0/0 > debug heartbeatmap = 0/0 > debug perfcounter = 0/0 > debug asok = 0/0 > debug throttle = 0/0 > > [mon] > mon osd down out interval = 600 > mon osd min down reporters = 13 > [mon.ceph-01] > host = ceph-01 > mon addr = 172.20.20.171 > [mon.ceph-02] > host = ceph-02 > mon addr = 172.20.20.172 > [mon.ceph-03] > host = ceph-03 > mon addr = 172.20.20.173 > > debug lockdep = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug buffer = 0/0 > debug timer = 0/0 > debug journaler = 0/0 > debug osd = 0/0 > debug optracker = 0/0 > debug objclass = 0/0 > debug filestore = 0/0 > debug journal = 0/0 > debug ms = 0/0 > debug monc = 0/0 > debug tp = 0/0 > debug auth = 0/0 > debug finisher = 0/0 > debug heartbeatmap = 0/0 > debug perfcounter = 0/0 > debug asok = 0/0 > debug throttle = 0/0 > > [osd] > osd mkfs type = xfs > osd mkfs options xfs = -f -i size=2048 osd mount options xfs = > rw,noatime,logbsize=256k,delaylog > osd journal size = 20480 > cluster_network = 172.20.20.0/24 > public_network = 172.20.20.0/24 > osd mon heartbeat interval = 30 > # Performance tuning > filestore merge threshold = 40 > filestore split multiple = 8 > osd op threads = 8 > # Recovery tuning > osd recovery max active = 1 > osd max backfills = 1 > osd recovery op priority = 1 > > > debug lockdep = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug buffer = 0/0 > debug timer = 0/0 > debug journaler = 0/0 > debug osd = 0/0 > debug optracker = 0/0 > debug objclass = 0/0 > debug filestore = 0/0 > debug journal = 0/0 > debug ms = 0/0 > debug monc = 0/0 > debug tp = 0/0 > debug auth = 0/0 > debug finisher = 0/0 > debug heartbeatmap = 0/0 > debug perfcounter = 0/0 > debug asok = 0/0 > debug throttle = 0/0 > > Disabling all debugging made me win 200/300 more IOPS. > > See my fio template: > > [global] > #logging > #write_iops_log=write_iops_log > #write_bw_log=write_bw_log > #write_lat_log=write_lat_lo > > time_based > runtime=60 > > ioengine=rbd > clientname=admin > pool=test > rbdname=fio > invalidate=0 # mandatory > #rw=randwrite > rw=write > bs=4k > #bs=32m > size=5G > group_reporting > > [rbd_iodepth32] > iodepth=32 > direct=1 > > See my rio output: > > rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, > iodepth=32 fio-2.1.11-14-gb74e Starting 1 process rbd engine: RBD > version: 0.1.8 > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] [0/3219/0 > iops] [eta 00m:00s] > rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 00:28:26 > 2014 > write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec > slat (usec): min=42, max=1578, avg=66.50, stdev=16.96 > clat (msec): min=1, max=28, avg= 9.85, stdev= 1.48 > lat (msec): min=1, max=28, avg= 9.92, stdev= 1.47 > clat percentiles (usec): > | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 9152], > | 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 60.00th=[10048], > | 70.00th=[10176], 80.00th=[10560], 90.00th=[10944], 95.00th=[11456], > | 99.00th=[13120], 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], > | 99.99th=[28032] > bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, > stdev=407.35 > lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 50=0.41% > cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, minf=426088 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 32=66.1%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=99.6%, 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, > >=64=0.0% > issued : total=r=0/w=192862/d=0, short=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=32 > > Run status group 0 (all jobs): > WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, > maxb=12855KB/s, mint=60010msec, maxt=60010msec > > Disk stats (read/write): > dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, > aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, aggrutil=0.01% > sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% > > I tried to tweak several parameters like: > > filestore_wbthrottle_xfs_ios_start_flusher = 10000 > filestore_wbthrottle_xfs_ios_hard_limit = 10000 > filestore_wbthrottle_btrfs_ios_start_flusher = 10000 > filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue max > ops = 2000 > > But didn't any improvement. > > Then I tried other things: > > * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 more > IOPS but it's not a realistic workload anymore and not that significant. > * adding another SSD for the journal, still getting 3,2K IOPS > * I tried with rbd bench and I also got 3K IOPS > * I ran the test on a client machine and then locally on the server, > still getting 3,2K IOPS > * put the journal in memory, still getting 3,2K IOPS > * with 2 clients running the test in parallel I got a total of 3,6K > IOPS but I don't seem to be able to go over > * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 journals on > 1 SSD, got 4,5K IOPS YAY! > > Given the results of the last time it seems that something is limiting the > number of IOPS per OSD process. > > Running the test on a client or locally didn't show any difference. > So it looks to me that there is some contention within Ceph that might cause > this. > > I also ran perf and looked at the output, everything looks decent, but > someone might want to have a look at it :). > > We have been able to reproduce this on 3 distinct platforms with some > deviations (because of the hardware) but the behaviour is the same. > Any thoughts will be highly appreciated, only getting 3,2k out of an 29K IOPS > SSD is a bit frustrating :). > > Cheers. > ---- > Sébastien Han > Cloud Architect > > "Always give 100%. Unless you're giving blood." > > Phone: +33 (0)1 49 70 99 72 > Mail: sebastien....@enovance.com > Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - > Twitter : @enovance > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com