Have anyone ever testing multi volume performance on a *FULL* SSD setup? We are able to get ~18K IOPS for 4K random read on a single volume with fio (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) IOPS even with multiple volumes. Seems the maximum random write performance we can get on the entire cluster is quite close to single volume performance.
Thanks Jian -----Original Message----- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sebastien Han Sent: Tuesday, September 16, 2014 9:33 PM To: Alexandre DERUMIER Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS Hi, Thanks for keeping us updated on this subject. dsync is definitely killing the ssd. I don't have much to add, I'm just surprised that you're only getting 5299 with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, that might explain this. On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderum...@odiso.com> wrote: > here the results for the intel s3500 > ------------------------------------ > max performance is with ceph 0.85 + optracker disabled. > intel s3500 don't have d_sync problem like crucial > > %util show almost 100% for read and write, so maybe the ssd disk performance > is the limit. > > I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to > bench them next week. > > > > > > > INTEL s3500 > ----------- > raw disk > -------- > > randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k > --iodepth=32 --group_reporting --invalidate=0 --name=abc > --ioengine=aio bw=288207KB/s, iops=72051 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 > 30,96 0,42 0,42 0,00 0,01 99,90 > > randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k > --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio > --sync=1 bw=48131KB/s, iops=12032 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 > 2,08 0,09 0,00 0,09 0,04 100,00 > > > ceph 0.80 > --------- > randread: no tuning: bw=24578KB/s, iops=6144 > > > randwrite: bw=10358KB/s, iops=2589 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 > 1,63 0,18 0,00 0,18 0,06 50,90 > > > ceph 0.85 : > --------- > > randread : bw=41406KB/s, iops=10351 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 > 1,36 0,13 0,13 0,00 0,07 75,90 > > randwrite : bw=17204KB/s, iops=4301 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 > 1,46 0,15 0,00 0,15 0,07 67,80 > > > ceph 0.85 tuning op_tracker=false > ---------------- > > randread : bw=86537KB/s, iops=21634 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 > 3,13 0,15 0,15 0,00 0,05 98,00 > > randwrite: bw=21199KB/s, iops=5299 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 > 2,09 0,21 0,00 0,21 0,07 80,00 > > > ----- Mail original ----- > > De: "Alexandre DERUMIER" <aderum...@odiso.com> > À: "Cedric Lemarchand" <ced...@yipikai.org> > Cc: ceph-users@lists.ceph.com > Envoyé: Vendredi 12 Septembre 2014 08:15:08 > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > 3, 2K IOPS > > results of fio on rbd with kernel patch > > > > fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same > result): > --------------------------- > bw=12327KB/s, iops=3081 > > So no much better than before, but this time, iostat show only 15% > utils, and latencies are lower > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 > 23,90 0,29 0,10 0,00 0,10 0,05 15,20 > > > So, the write bottleneck seem to be in ceph. > > > > I will send s3500 result today > > ----- Mail original ----- > > De: "Alexandre DERUMIER" <aderum...@odiso.com> > À: "Cedric Lemarchand" <ced...@yipikai.org> > Cc: ceph-users@lists.ceph.com > Envoyé: Vendredi 12 Septembre 2014 07:58:05 > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > 3, 2K IOPS > >>> For crucial, I'll try to apply the patch from stefan priebe, to >>> ignore flushes (as crucial m550 have supercaps) >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/03 >>> 5707.html > Here the results, disable cache flush > > crucial m550 > ------------ > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, > iops=44393 > > > ----- Mail original ----- > > De: "Alexandre DERUMIER" <aderum...@odiso.com> > À: "Cedric Lemarchand" <ced...@yipikai.org> > Cc: ceph-users@lists.ceph.com > Envoyé: Vendredi 12 Septembre 2014 04:55:21 > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > 3, 2K IOPS > > Hi, > seem that intel s3500 perform a lot better with o_dsync > > crucial m550 > ------------ > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s, > iops=312 > > intel s3500 > ----------- > fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s, > iops=10448 > > ok, so 30x faster. > > > > For crucial, I have try to apply the patch from stefan priebe, to > ignore flushes (as crucial m550 have supercaps) > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0357 > 07.html Coming from zfs, this sound like "zfs_nocacheflush" > > Now results: > > crucial m550 > ------------ > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, > iops=44393 > > > > fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same > result): > --------------------------- > bw=12327KB/s, iops=3081 > > So no much better than before, but this time, iostat show only 15% > utils, and latencies are lower > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 > 23,90 0,29 0,10 0,00 0,10 0,05 15,20 > > > So, the write bottleneck seem to be in ceph. > > > > I will send s3500 result today > > ----- Mail original ----- > > De: "Cedric Lemarchand" <ced...@yipikai.org> > À: ceph-users@lists.ceph.com > Envoyé: Jeudi 11 Septembre 2014 21:23:23 > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > 3, 2K IOPS > > > Le 11/09/2014 19:33, Cedric Lemarchand a écrit : >> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : >>> Hi Sebastien, >>> >>> here my first results with crucial m550 (I'll send result with intel s3500 >>> later): >>> >>> - 3 nodes >>> - dell r620 without expander backplane >>> - sas controller : lsi LSI 9207 (no hardware raid or cache) >>> - 2 x E5-2603v2 1.8GHz (4cores) >>> - 32GB ram >>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. >>> >>> -os : debian wheezy, with kernel 3.10 >>> >>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial >>> m550 (1TB). >>> >>> >>> 3mon in the ceph cluster, >>> and 1 osd (journal and datas on same disk) >>> >>> >>> ceph.conf >>> --------- >>> debug_lockdep = 0/0 >>> debug_context = 0/0 >>> debug_crush = 0/0 >>> debug_buffer = 0/0 >>> debug_timer = 0/0 >>> debug_filer = 0/0 >>> debug_objecter = 0/0 >>> debug_rados = 0/0 >>> debug_rbd = 0/0 >>> debug_journaler = 0/0 >>> debug_objectcatcher = 0/0 >>> debug_client = 0/0 >>> debug_osd = 0/0 >>> debug_optracker = 0/0 >>> debug_objclass = 0/0 >>> debug_filestore = 0/0 >>> debug_journal = 0/0 >>> debug_ms = 0/0 >>> debug_monc = 0/0 >>> debug_tp = 0/0 >>> debug_auth = 0/0 >>> debug_finisher = 0/0 >>> debug_heartbeatmap = 0/0 >>> debug_perfcounter = 0/0 >>> debug_asok = 0/0 >>> debug_throttle = 0/0 >>> debug_mon = 0/0 >>> debug_paxos = 0/0 >>> debug_rgw = 0/0 >>> osd_op_threads = 5 >>> filestore_op_threads = 4 >>> >>> ms_nocrc = true >>> cephx sign messages = false >>> cephx require signatures = false >>> >>> ms_dispatch_throttle_bytes = 0 >>> >>> #0.85 >>> throttler_perf_counter = false >>> filestore_fd_cache_size = 64 >>> filestore_fd_cache_shards = 32 >>> osd_op_num_threads_per_shard = 1 >>> osd_op_num_shards = 25 >>> osd_enable_op_tracker = true >>> >>> >>> >>> Fio disk 4K benchmark >>> ------------------ >>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread >>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc >>> --ioengine=aio bw=271755KB/s, iops=67938 >>> >>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite >>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc >>> --ioengine=aio bw=228293KB/s, iops=57073 >>> >>> >>> >>> fio osd benchmark (through librbd) >>> ---------------------------------- >>> [global] >>> ioengine=rbd >>> clientname=admin >>> pool=test >>> rbdname=test >>> invalidate=0 # mandatory >>> rw=randwrite >>> rw=randread >>> bs=4k >>> direct=1 >>> numjobs=4 >>> group_reporting=1 >>> >>> [rbd_iodepth32] >>> iodepth=32 >>> >>> >>> >>> FIREFLY RESULTS >>> ---------------- >>> fio randwrite : bw=5009.6KB/s, iops=1252 >>> >>> fio randread: bw=37820KB/s, iops=9455 >>> >>> >>> >>> O.85 RESULTS >>> ------------ >>> >>> fio randwrite : bw=11658KB/s, iops=2914 >>> >>> fio randread : bw=38642KB/s, iops=9660 >>> >>> >>> >>> 0.85 + osd_enable_op_tracker=false >>> ----------------------------------- >>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : bw=80606KB/s, >>> iops=20151, (cpu 100% - GREAT !) >>> >>> >>> >>> So, for read, seem that osd_enable_op_tracker is the bottleneck. >>> >>> >>> Now for write, I really don't understand why it's so low. >>> >>> >>> I have done some iostat: >>> >>> >>> FIO directly on /dev/sdb >>> bw=228293KB/s, iops=57073 >>> >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await >>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00 >>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00 >>> >>> >>> FIO directly on osd through librbd >>> bw=11658KB/s, iops=2914 >>> >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await >>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00 >>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70 >>> >>> >>> (I don't understand what exactly is %util, 100% in the 2 cases, >>> because 10x slower with ceph) >> It would be interesting if you could catch the size of writes on SSD >> during the bench through librbd (I know nmon can do that) > Replying to myself ... I ask a bit quickly in the way we already have > this information (29678 / 5225 = 5,68Ko), but this is irrelevant. > > Cheers > >>> It could be a dsync problem, result seem pretty poor >>> >>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct >>> 65536+0 enregistrements lus >>> 65536+0 enregistrements écrits >>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s >>> >>> >>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct >>> ^C17228+0 enregistrements lus >>> 17228+0 enregistrements écrits >>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s >>> >>> >>> >>> I'll do tests with intel s3500 tomorrow to compare >>> >>> ----- Mail original ----- >>> >>> De: "Sebastien Han" <sebastien....@enovance.com> >>> À: "Warren Wang" <warren_w...@cable.comcast.com> >>> Cc: ceph-users@lists.ceph.com >>> Envoyé: Lundi 8 Septembre 2014 22:58:25 >>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go >>> over 3, 2K IOPS >>> >>> They definitely are Warren! >>> >>> Thanks for bringing this here :). >>> >>> On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> >>> wrote: >>> >>>> +1 to what Cedric said. >>>> >>>> Anything more than a few minutes of heavy sustained writes tended to get >>>> our solid state devices into a state where garbage collection could not >>>> keep up. Originally we used small SSDs and did not overprovision the >>>> journals by much. Manufacturers publish their SSD stats, and then in very >>>> small font, state that the attained IOPS are with empty drives, and the >>>> tests are only run for very short amounts of time. Even if the drives are >>>> new, it's a good idea to perform an hdparm secure erase on them (so that >>>> the SSD knows that the blocks are truly unused), and then overprovision >>>> them. You'll know if you have a problem by watching for utilization and >>>> wait data on the journals. >>>> >>>> One of the other interesting performance issues is that the Intel 10Gbe >>>> NICs + default kernel that we typically use max out around 1million >>>> packets/sec. It's worth tracking this metric to if you are close. >>>> >>>> I know these aren't necessarily relevant to the test parameters you gave >>>> below, but they're worth keeping in mind. >>>> >>>> -- >>>> Warren Wang >>>> Comcast Cloud (OpenStack) >>>> >>>> >>>> From: Cedric Lemarchand <ced...@yipikai.org> >>>> Date: Wednesday, September 3, 2014 at 5:14 PM >>>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> >>>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go >>>> over 3, 2K IOPS >>>> >>>> >>>> Le 03/09/2014 22:11, Sebastien Han a écrit : >>>>> Hi Warren, >>>>> >>>>> What do mean exactly by secure erase? At the firmware level with >>>>> constructor softwares? >>>>> SSDs were pretty new so I don't we hit that sort of things. I believe >>>>> that only aged SSDs have this behaviour but I might be wrong. >>>>> >>>> Sorry I forgot to reply to the real question ;-) So yes it only >>>> plays after some times, for your case, if the SSD still delivers write >>>> IOPS specified by the manufacturer, it will doesn't help in any ways. >>>> >>>> But it seems this practice is nowadays increasingly used. >>>> >>>> Cheers >>>>> On 02 Sep 2014, at 18:23, Wang, Warren >>>>> <warren_w...@cable.comcast.com> >>>>> wrote: >>>>> >>>>> >>>>>> Hi Sebastien, >>>>>> >>>>>> Something I didn't see in the thread so far, did you secure erase the >>>>>> SSDs before they got used? I assume these were probably repurposed for >>>>>> this test. We have seen some pretty significant garbage collection issue >>>>>> on various SSD and other forms of solid state storage to the point where >>>>>> we are overprovisioning pretty much every solid state device now. By as >>>>>> much as 50% to handle sustained write operations. Especially important >>>>>> for the journals, as we've found. >>>>>> >>>>>> Maybe not an issue on the short fio run below, but certainly evident on >>>>>> longer runs or lots of historical data on the drives. The max >>>>>> transaction time looks pretty good for your test. Something to consider >>>>>> though. >>>>>> >>>>>> Warren >>>>>> >>>>>> -----Original Message----- >>>>>> From: ceph-users [ >>>>>> mailto:ceph-users-boun...@lists.ceph.com >>>>>> ] On Behalf Of Sebastien Han >>>>>> Sent: Thursday, August 28, 2014 12:12 PM >>>>>> To: ceph-users >>>>>> Cc: Mark Nelson >>>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go >>>>>> over 3, 2K IOPS >>>>>> >>>>>> Hey all, >>>>>> >>>>>> It has been a while since the last thread performance related on the ML >>>>>> :p I've been running some experiment to see how much I can get from an >>>>>> SSD on a Ceph cluster. >>>>>> To achieve that I did something pretty simple: >>>>>> >>>>>> * Debian wheezy 7.6 >>>>>> * kernel from debian 3.14-0.bpo.2-amd64 >>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a >>>>>> real deployment i'll use 3) >>>>>> * 1 OSD backed by an SSD (journal and osd data on the same >>>>>> device) >>>>>> * 1 replica count of 1 >>>>>> * partitions are perfectly aligned >>>>>> * io scheduler is set to noon but deadline was showing the same >>>>>> results >>>>>> * no updatedb running >>>>>> >>>>>> About the box: >>>>>> >>>>>> * 32GB of RAM >>>>>> * 12 cores with HT @ 2,4 GHz >>>>>> * WB cache is enabled on the controller >>>>>> * 10Gbps network (doesn't help here) >>>>>> >>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 29K >>>>>> iops with random 4k writes (my fio results) As a benchmark tool I used >>>>>> fio with the rbd engine (thanks deutsche telekom guys!). >>>>>> >>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: >>>>>> >>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 >>>>>> 65536+0 records in >>>>>> 65536+0 records out >>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s >>>>>> >>>>>> # du -sh rand.file >>>>>> 256M rand.file >>>>>> >>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 >>>>>> oflag=dsync,direct >>>>>> 65536+0 records in >>>>>> 65536+0 records out >>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s >>>>>> >>>>>> See my ceph.conf: >>>>>> >>>>>> [global] >>>>>> auth cluster required = cephx >>>>>> auth service required = cephx >>>>>> auth client required = cephx >>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 >>>>>> osd pool default pg num = 4096 >>>>>> osd pool default pgp num = 4096 >>>>>> osd pool default size = 2 >>>>>> osd crush chooseleaf type = 0 >>>>>> >>>>>> debug lockdep = 0/0 >>>>>> debug context = 0/0 >>>>>> debug crush = 0/0 >>>>>> debug buffer = 0/0 >>>>>> debug timer = 0/0 >>>>>> debug journaler = 0/0 >>>>>> debug osd = 0/0 >>>>>> debug optracker = 0/0 >>>>>> debug objclass = 0/0 >>>>>> debug filestore = 0/0 >>>>>> debug journal = 0/0 >>>>>> debug ms = 0/0 >>>>>> debug monc = 0/0 >>>>>> debug tp = 0/0 >>>>>> debug auth = 0/0 >>>>>> debug finisher = 0/0 >>>>>> debug heartbeatmap = 0/0 >>>>>> debug perfcounter = 0/0 >>>>>> debug asok = 0/0 >>>>>> debug throttle = 0/0 >>>>>> >>>>>> [mon] >>>>>> mon osd down out interval = 600 >>>>>> mon osd min down reporters = 13 >>>>>> [mon.ceph-01] >>>>>> host = ceph-01 >>>>>> mon addr = 172.20.20.171 >>>>>> [mon.ceph-02] >>>>>> host = ceph-02 >>>>>> mon addr = 172.20.20.172 >>>>>> [mon.ceph-03] >>>>>> host = ceph-03 >>>>>> mon addr = 172.20.20.173 >>>>>> >>>>>> debug lockdep = 0/0 >>>>>> debug context = 0/0 >>>>>> debug crush = 0/0 >>>>>> debug buffer = 0/0 >>>>>> debug timer = 0/0 >>>>>> debug journaler = 0/0 >>>>>> debug osd = 0/0 >>>>>> debug optracker = 0/0 >>>>>> debug objclass = 0/0 >>>>>> debug filestore = 0/0 >>>>>> debug journal = 0/0 >>>>>> debug ms = 0/0 >>>>>> debug monc = 0/0 >>>>>> debug tp = 0/0 >>>>>> debug auth = 0/0 >>>>>> debug finisher = 0/0 >>>>>> debug heartbeatmap = 0/0 >>>>>> debug perfcounter = 0/0 >>>>>> debug asok = 0/0 >>>>>> debug throttle = 0/0 >>>>>> >>>>>> [osd] >>>>>> osd mkfs type = xfs >>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = >>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480 >>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 >>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore >>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads >>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max >>>>>> backfills = 1 osd recovery op priority = 1 >>>>>> >>>>>> >>>>>> debug lockdep = 0/0 >>>>>> debug context = 0/0 >>>>>> debug crush = 0/0 >>>>>> debug buffer = 0/0 >>>>>> debug timer = 0/0 >>>>>> debug journaler = 0/0 >>>>>> debug osd = 0/0 >>>>>> debug optracker = 0/0 >>>>>> debug objclass = 0/0 >>>>>> debug filestore = 0/0 >>>>>> debug journal = 0/0 >>>>>> debug ms = 0/0 >>>>>> debug monc = 0/0 >>>>>> debug tp = 0/0 >>>>>> debug auth = 0/0 >>>>>> debug finisher = 0/0 >>>>>> debug heartbeatmap = 0/0 >>>>>> debug perfcounter = 0/0 >>>>>> debug asok = 0/0 >>>>>> debug throttle = 0/0 >>>>>> >>>>>> Disabling all debugging made me win 200/300 more IOPS. >>>>>> >>>>>> See my fio template: >>>>>> >>>>>> [global] >>>>>> #logging >>>>>> #write_iops_log=write_iops_log >>>>>> #write_bw_log=write_bw_log >>>>>> #write_lat_log=write_lat_lo >>>>>> >>>>>> time_based >>>>>> runtime=60 >>>>>> >>>>>> ioengine=rbd >>>>>> clientname=admin >>>>>> pool=test >>>>>> rbdname=fio >>>>>> invalidate=0 # mandatory >>>>>> #rw=randwrite >>>>>> rw=write >>>>>> bs=4k >>>>>> #bs=32m >>>>>> size=5G >>>>>> group_reporting >>>>>> >>>>>> [rbd_iodepth32] >>>>>> iodepth=32 >>>>>> direct=1 >>>>>> >>>>>> See my rio output: >>>>>> >>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, >>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process >>>>>> rbd engine: RBD version: 0.1.8 >>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] >>>>>> [0/3219/0 iops] [eta 00m:00s] >>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 >>>>>> 00:28:26 2014 >>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec slat >>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec): >>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28, >>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec): >>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ >>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], >>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560], >>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120], >>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], >>>>>> | 99.99th=[28032] >>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, >>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, >>>>>> 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, >>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, >>>>>> 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, >>>>>> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, 8=0.4%, >>>>>> 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued : >>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0, >>>>>> window=0, percentile=100.00%, depth=32 >>>>>> >>>>>> Run status group 0 (all jobs): >>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, >>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec >>>>>> >>>>>> Disk stats (read/write): >>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, >>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, >>>>>> aggrutil=0.01% >>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% >>>>>> >>>>>> I tried to tweak several parameters like: >>>>>> >>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 >>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 >>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 >>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue >>>>>> max ops = 2000 >>>>>> >>>>>> But didn't any improvement. >>>>>> >>>>>> Then I tried other things: >>>>>> >>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 >>>>>> more IOPS but it's not a realistic workload anymore and not that >>>>>> significant. >>>>>> * adding another SSD for the journal, still getting 3,2K IOPS >>>>>> * I tried with rbd bench and I also got 3K IOPS >>>>>> * I ran the test on a client machine and then locally on the >>>>>> server, still getting 3,2K IOPS >>>>>> * put the journal in memory, still getting 3,2K IOPS >>>>>> * with 2 clients running the test in parallel I got a total of >>>>>> 3,6K IOPS but I don't seem to be able to go over >>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 >>>>>> journals on 1 SSD, got 4,5K IOPS YAY! >>>>>> >>>>>> Given the results of the last time it seems that something is limiting >>>>>> the number of IOPS per OSD process. >>>>>> >>>>>> Running the test on a client or locally didn't show any difference. >>>>>> So it looks to me that there is some contention within Ceph that might >>>>>> cause this. >>>>>> >>>>>> I also ran perf and looked at the output, everything looks decent, but >>>>>> someone might want to have a look at it :). >>>>>> >>>>>> We have been able to reproduce this on 3 distinct platforms with some >>>>>> deviations (because of the hardware) but the behaviour is the same. >>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 29K >>>>>> IOPS SSD is a bit frustrating :). >>>>>> >>>>>> Cheers. >>>>>> ---- >>>>>> Sébastien Han >>>>>> Cloud Architect >>>>>> >>>>>> "Always give 100%. Unless you're giving blood." >>>>>> >>>>>> Phone: +33 (0)1 49 70 99 72 >>>>>> Mail: >>>>>> sebastien....@enovance.com >>>>>> >>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : >>>>>> www.enovance.com >>>>>> - Twitter : @enovance >>>>>> >>>>>> >>>>> Cheers. >>>>> ---- >>>>> Sébastien Han >>>>> Cloud Architect >>>>> >>>>> "Always give 100%. Unless you're giving blood." >>>>> >>>>> Phone: +33 (0)1 49 70 99 72 >>>>> Mail: >>>>> sebastien....@enovance.com >>>>> >>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : >>>>> www.enovance.com >>>>> - Twitter : @enovance >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> >>>>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-u >>>>> sers-ceph.com >>>> -- >>>> Cédric >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> Cheers. >>> ---- >>> Sébastien Han >>> Cloud Architect >>> >>> "Always give 100%. Unless you're giving blood." >>> >>> Phone: +33 (0)1 49 70 99 72 >>> Mail: sebastien....@enovance.com >>> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com >>> - Twitter : @enovance >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > Cédric > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Cheers. ---- Sébastien Han Cloud Architect "Always give 100%. Unless you're giving blood." Phone: +33 (0)1 49 70 99 72 Mail: sebastien....@enovance.com Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - Twitter : @enovance _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com