On Thu, Sep 18, 2014 at 03:36:48PM +0200, Alexandre DERUMIER wrote: > >>Have anyone ever testing multi volume performance on a *FULL* SSD setup? > > I known that Stefan Priebe run full ssd clusters in production, and have done > benchmark. > (Ad far I remember, he have benched around 20k peak with dumpling) > > >>We are able to get ~18K IOPS for 4K random read on a single volume with fio > >>(with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) > >>IOPS even with multiple volumes. > >>Seems the maximum random write performance we can get on the entire cluster > >>is quite close to single volume performance. > Firefly or Giant ?
Seems the max. possible 4k seq-write IOPS you can get is around ~20K IOPS, independent if 2 or 400 OSDs, independent if SAS or SSD, independent if 3 or 9 storage nodes. The CPU is the limiting resource, because of the overhead in the code. My IO Subsystem would be able to handle 2 Mio IOPS on 4K writes with repli=2 . 9 Storage nodes . in total 18x P3700 Intel PCIe-SSDs over NVMe (each 150k random write IOPS on 4K) . in total 357x SAS 2.5" via 18x LSI MegaRAID-2208 . 10 GbE to the 9 client nodes . 56GbIB as Cluster interconnect There was an improvement between 0.80.x and 0.81, but then the performance droped again ... (see attachment) -Dieter > > I'll do benchs with 6 osd dc3500 tomorrow to compare firefly and giant. > > ----- Mail original ----- > > De: "Jian Zhang" <jian.zh...@intel.com> > À: "Sebastien Han" <sebastien....@enovance.com>, "Alexandre DERUMIER" > <aderum...@odiso.com> > Cc: ceph-users@lists.ceph.com > Envoyé: Jeudi 18 Septembre 2014 08:12:32 > Objet: RE: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K > IOPS > > Have anyone ever testing multi volume performance on a *FULL* SSD setup? > We are able to get ~18K IOPS for 4K random read on a single volume with fio > (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) > IOPS even with multiple volumes. > Seems the maximum random write performance we can get on the entire cluster > is quite close to single volume performance. > > Thanks > Jian > > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Sebastien Han > Sent: Tuesday, September 16, 2014 9:33 PM > To: Alexandre DERUMIER > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K > IOPS > > Hi, > > Thanks for keeping us updated on this subject. > dsync is definitely killing the ssd. > > I don't have much to add, I'm just surprised that you're only getting 5299 > with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, > that might explain this. > > > On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderum...@odiso.com> wrote: > > > here the results for the intel s3500 > > ------------------------------------ > > max performance is with ceph 0.85 + optracker disabled. > > intel s3500 don't have d_sync problem like crucial > > > > %util show almost 100% for read and write, so maybe the ssd disk > > performance is the limit. > > > > I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try > > to bench them next week. > > > > > > > > > > > > > > INTEL s3500 > > ----------- > > raw disk > > -------- > > > > randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k > > --iodepth=32 --group_reporting --invalidate=0 --name=abc > > --ioengine=aio bw=288207KB/s, iops=72051 > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 30,96 0,42 0,42 0,00 0,01 > > 99,90 > > > > randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k > > --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio > > --sync=1 bw=48131KB/s, iops=12032 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 2,08 0,09 0,00 0,09 0,04 > > 100,00 > > > > > > ceph 0.80 > > --------- > > randread: no tuning: bw=24578KB/s, iops=6144 > > > > > > randwrite: bw=10358KB/s, iops=2589 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 1,63 0,18 0,00 0,18 0,06 > > 50,90 > > > > > > ceph 0.85 : > > --------- > > > > randread : bw=41406KB/s, iops=10351 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 1,36 0,13 0,13 0,00 0,07 > > 75,90 > > > > randwrite : bw=17204KB/s, iops=4301 > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 1,46 0,15 0,00 0,15 0,07 > > 67,80 > > > > > > ceph 0.85 tuning op_tracker=false > > ---------------- > > > > randread : bw=86537KB/s, iops=21634 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 3,13 0,15 0,15 0,00 0,05 > > 98,00 > > > > randwrite: bw=21199KB/s, iops=5299 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await > > w_await svctm %util > > sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 2,09 0,21 0,00 0,21 0,07 > > 80,00 > > > > > > ----- Mail original ----- > > > > De: "Alexandre DERUMIER" <aderum...@odiso.com> > > À: "Cedric Lemarchand" <ced...@yipikai.org> > > Cc: ceph-users@lists.ceph.com > > Envoyé: Vendredi 12 Septembre 2014 08:15:08 > > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > > 3, 2K IOPS > > > > results of fio on rbd with kernel patch > > > > > > > > fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same > > result): > > --------------------------- > > bw=12327KB/s, iops=3081 > > > > So no much better than before, but this time, iostat show only 15% > > utils, and latencies are lower > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > > r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 > > 23,90 0,29 0,10 0,00 0,10 0,05 15,20 > > > > > > So, the write bottleneck seem to be in ceph. > > > > > > > > I will send s3500 result today > > > > ----- Mail original ----- > > > > De: "Alexandre DERUMIER" <aderum...@odiso.com> > > À: "Cedric Lemarchand" <ced...@yipikai.org> > > Cc: ceph-users@lists.ceph.com > > Envoyé: Vendredi 12 Septembre 2014 07:58:05 > > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > > 3, 2K IOPS > > > >>> For crucial, I'll try to apply the patch from stefan priebe, to > >>> ignore flushes (as crucial m550 have supercaps) > >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/03 > >>> 5707.html > > Here the results, disable cache flush > > > > crucial m550 > > ------------ > > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > > --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, > > iops=44393 > > > > > > ----- Mail original ----- > > > > De: "Alexandre DERUMIER" <aderum...@odiso.com> > > À: "Cedric Lemarchand" <ced...@yipikai.org> > > Cc: ceph-users@lists.ceph.com > > Envoyé: Vendredi 12 Septembre 2014 04:55:21 > > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > > 3, 2K IOPS > > > > Hi, > > seem that intel s3500 perform a lot better with o_dsync > > > > crucial m550 > > ------------ > > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > > --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s, > > iops=312 > > > > intel s3500 > > ----------- > > fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > > --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s, > > iops=10448 > > > > ok, so 30x faster. > > > > > > > > For crucial, I have try to apply the patch from stefan priebe, to > > ignore flushes (as crucial m550 have supercaps) > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0357 > > 07.html Coming from zfs, this sound like "zfs_nocacheflush" > > > > Now results: > > > > crucial m550 > > ------------ > > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > > --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, > > iops=44393 > > > > > > > > fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same > > result): > > --------------------------- > > bw=12327KB/s, iops=3081 > > > > So no much better than before, but this time, iostat show only 15% > > utils, and latencies are lower > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > > r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 > > 23,90 0,29 0,10 0,00 0,10 0,05 15,20 > > > > > > So, the write bottleneck seem to be in ceph. > > > > > > > > I will send s3500 result today > > > > ----- Mail original ----- > > > > De: "Cedric Lemarchand" <ced...@yipikai.org> > > À: ceph-users@lists.ceph.com > > Envoyé: Jeudi 11 Septembre 2014 21:23:23 > > Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over > > 3, 2K IOPS > > > > > > Le 11/09/2014 19:33, Cedric Lemarchand a écrit : > >> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : > >>> Hi Sebastien, > >>> > >>> here my first results with crucial m550 (I'll send result with intel > >>> s3500 later): > >>> > >>> - 3 nodes > >>> - dell r620 without expander backplane > >>> - sas controller : lsi LSI 9207 (no hardware raid or cache) > >>> - 2 x E5-2603v2 1.8GHz (4cores) > >>> - 32GB ram > >>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. > >>> > >>> -os : debian wheezy, with kernel 3.10 > >>> > >>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial > >>> m550 (1TB). > >>> > >>> > >>> 3mon in the ceph cluster, > >>> and 1 osd (journal and datas on same disk) > >>> > >>> > >>> ceph.conf > >>> --------- > >>> debug_lockdep = 0/0 > >>> debug_context = 0/0 > >>> debug_crush = 0/0 > >>> debug_buffer = 0/0 > >>> debug_timer = 0/0 > >>> debug_filer = 0/0 > >>> debug_objecter = 0/0 > >>> debug_rados = 0/0 > >>> debug_rbd = 0/0 > >>> debug_journaler = 0/0 > >>> debug_objectcatcher = 0/0 > >>> debug_client = 0/0 > >>> debug_osd = 0/0 > >>> debug_optracker = 0/0 > >>> debug_objclass = 0/0 > >>> debug_filestore = 0/0 > >>> debug_journal = 0/0 > >>> debug_ms = 0/0 > >>> debug_monc = 0/0 > >>> debug_tp = 0/0 > >>> debug_auth = 0/0 > >>> debug_finisher = 0/0 > >>> debug_heartbeatmap = 0/0 > >>> debug_perfcounter = 0/0 > >>> debug_asok = 0/0 > >>> debug_throttle = 0/0 > >>> debug_mon = 0/0 > >>> debug_paxos = 0/0 > >>> debug_rgw = 0/0 > >>> osd_op_threads = 5 > >>> filestore_op_threads = 4 > >>> > >>> ms_nocrc = true > >>> cephx sign messages = false > >>> cephx require signatures = false > >>> > >>> ms_dispatch_throttle_bytes = 0 > >>> > >>> #0.85 > >>> throttler_perf_counter = false > >>> filestore_fd_cache_size = 64 > >>> filestore_fd_cache_shards = 32 > >>> osd_op_num_threads_per_shard = 1 > >>> osd_op_num_shards = 25 > >>> osd_enable_op_tracker = true > >>> > >>> > >>> > >>> Fio disk 4K benchmark > >>> ------------------ > >>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread > >>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc > >>> --ioengine=aio bw=271755KB/s, iops=67938 > >>> > >>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite > >>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc > >>> --ioengine=aio bw=228293KB/s, iops=57073 > >>> > >>> > >>> > >>> fio osd benchmark (through librbd) > >>> ---------------------------------- > >>> [global] > >>> ioengine=rbd > >>> clientname=admin > >>> pool=test > >>> rbdname=test > >>> invalidate=0 # mandatory > >>> rw=randwrite > >>> rw=randread > >>> bs=4k > >>> direct=1 > >>> numjobs=4 > >>> group_reporting=1 > >>> > >>> [rbd_iodepth32] > >>> iodepth=32 > >>> > >>> > >>> > >>> FIREFLY RESULTS > >>> ---------------- > >>> fio randwrite : bw=5009.6KB/s, iops=1252 > >>> > >>> fio randread: bw=37820KB/s, iops=9455 > >>> > >>> > >>> > >>> O.85 RESULTS > >>> ------------ > >>> > >>> fio randwrite : bw=11658KB/s, iops=2914 > >>> > >>> fio randread : bw=38642KB/s, iops=9660 > >>> > >>> > >>> > >>> 0.85 + osd_enable_op_tracker=false > >>> ----------------------------------- > >>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : bw=80606KB/s, > >>> iops=20151, (cpu 100% - GREAT !) > >>> > >>> > >>> > >>> So, for read, seem that osd_enable_op_tracker is the bottleneck. > >>> > >>> > >>> Now for write, I really don't understand why it's so low. > >>> > >>> > >>> I have done some iostat: > >>> > >>> > >>> FIO directly on /dev/sdb > >>> bw=228293KB/s, iops=57073 > >>> > >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00 > >>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00 > >>> > >>> > >>> FIO directly on osd through librbd > >>> bw=11658KB/s, iops=2914 > >>> > >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await > >>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00 > >>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70 > >>> > >>> > >>> (I don't understand what exactly is %util, 100% in the 2 cases, > >>> because 10x slower with ceph) > >> It would be interesting if you could catch the size of writes on SSD > >> during the bench through librbd (I know nmon can do that) > > Replying to myself ... I ask a bit quickly in the way we already have > > this information (29678 / 5225 = 5,68Ko), but this is irrelevant. > > > > Cheers > > > >>> It could be a dsync problem, result seem pretty poor > >>> > >>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct > >>> 65536+0 enregistrements lus > >>> 65536+0 enregistrements écrits > >>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s > >>> > >>> > >>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct > >>> ^C17228+0 enregistrements lus > >>> 17228+0 enregistrements écrits > >>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s > >>> > >>> > >>> > >>> I'll do tests with intel s3500 tomorrow to compare > >>> > >>> ----- Mail original ----- > >>> > >>> De: "Sebastien Han" <sebastien....@enovance.com> > >>> À: "Warren Wang" <warren_w...@cable.comcast.com> > >>> Cc: ceph-users@lists.ceph.com > >>> Envoyé: Lundi 8 Septembre 2014 22:58:25 > >>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go > >>> over 3, 2K IOPS > >>> > >>> They definitely are Warren! > >>> > >>> Thanks for bringing this here :). > >>> > >>> On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> > >>> wrote: > >>> > >>>> +1 to what Cedric said. > >>>> > >>>> Anything more than a few minutes of heavy sustained writes tended to get > >>>> our solid state devices into a state where garbage collection could not > >>>> keep up. Originally we used small SSDs and did not overprovision the > >>>> journals by much. Manufacturers publish their SSD stats, and then in > >>>> very small font, state that the attained IOPS are with empty drives, and > >>>> the tests are only run for very short amounts of time. Even if the > >>>> drives are new, it's a good idea to perform an hdparm secure erase on > >>>> them (so that the SSD knows that the blocks are truly unused), and then > >>>> overprovision them. You'll know if you have a problem by watching for > >>>> utilization and wait data on the journals. > >>>> > >>>> One of the other interesting performance issues is that the Intel 10Gbe > >>>> NICs + default kernel that we typically use max out around 1million > >>>> packets/sec. It's worth tracking this metric to if you are close. > >>>> > >>>> I know these aren't necessarily relevant to the test parameters you gave > >>>> below, but they're worth keeping in mind. > >>>> > >>>> -- > >>>> Warren Wang > >>>> Comcast Cloud (OpenStack) > >>>> > >>>> > >>>> From: Cedric Lemarchand <ced...@yipikai.org> > >>>> Date: Wednesday, September 3, 2014 at 5:14 PM > >>>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> > >>>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go > >>>> over 3, 2K IOPS > >>>> > >>>> > >>>> Le 03/09/2014 22:11, Sebastien Han a écrit : > >>>>> Hi Warren, > >>>>> > >>>>> What do mean exactly by secure erase? At the firmware level with > >>>>> constructor softwares? > >>>>> SSDs were pretty new so I don't we hit that sort of things. I believe > >>>>> that only aged SSDs have this behaviour but I might be wrong. > >>>>> > >>>> Sorry I forgot to reply to the real question ;-) So yes it only > >>>> plays after some times, for your case, if the SSD still delivers write > >>>> IOPS specified by the manufacturer, it will doesn't help in any ways. > >>>> > >>>> But it seems this practice is nowadays increasingly used. > >>>> > >>>> Cheers > >>>>> On 02 Sep 2014, at 18:23, Wang, Warren > >>>>> <warren_w...@cable.comcast.com> > >>>>> wrote: > >>>>> > >>>>> > >>>>>> Hi Sebastien, > >>>>>> > >>>>>> Something I didn't see in the thread so far, did you secure erase the > >>>>>> SSDs before they got used? I assume these were probably repurposed for > >>>>>> this test. We have seen some pretty significant garbage collection > >>>>>> issue on various SSD and other forms of solid state storage to the > >>>>>> point where we are overprovisioning pretty much every solid state > >>>>>> device now. By as much as 50% to handle sustained write operations. > >>>>>> Especially important for the journals, as we've found. > >>>>>> > >>>>>> Maybe not an issue on the short fio run below, but certainly evident > >>>>>> on longer runs or lots of historical data on the drives. The max > >>>>>> transaction time looks pretty good for your test. Something to > >>>>>> consider though. > >>>>>> > >>>>>> Warren > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: ceph-users [ > >>>>>> mailto:ceph-users-boun...@lists.ceph.com > >>>>>> ] On Behalf Of Sebastien Han > >>>>>> Sent: Thursday, August 28, 2014 12:12 PM > >>>>>> To: ceph-users > >>>>>> Cc: Mark Nelson > >>>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go > >>>>>> over 3, 2K IOPS > >>>>>> > >>>>>> Hey all, > >>>>>> > >>>>>> It has been a while since the last thread performance related on the > >>>>>> ML :p I've been running some experiment to see how much I can get from > >>>>>> an SSD on a Ceph cluster. > >>>>>> To achieve that I did something pretty simple: > >>>>>> > >>>>>> * Debian wheezy 7.6 > >>>>>> * kernel from debian 3.14-0.bpo.2-amd64 > >>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a > >>>>>> real deployment i'll use 3) > >>>>>> * 1 OSD backed by an SSD (journal and osd data on the same > >>>>>> device) > >>>>>> * 1 replica count of 1 > >>>>>> * partitions are perfectly aligned > >>>>>> * io scheduler is set to noon but deadline was showing the same > >>>>>> results > >>>>>> * no updatedb running > >>>>>> > >>>>>> About the box: > >>>>>> > >>>>>> * 32GB of RAM > >>>>>> * 12 cores with HT @ 2,4 GHz > >>>>>> * WB cache is enabled on the controller > >>>>>> * 10Gbps network (doesn't help here) > >>>>>> > >>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around > >>>>>> 29K iops with random 4k writes (my fio results) As a benchmark tool I > >>>>>> used fio with the rbd engine (thanks deutsche telekom guys!). > >>>>>> > >>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: > >>>>>> > >>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 > >>>>>> 65536+0 records in > >>>>>> 65536+0 records out > >>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s > >>>>>> > >>>>>> # du -sh rand.file > >>>>>> 256M rand.file > >>>>>> > >>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 > >>>>>> oflag=dsync,direct > >>>>>> 65536+0 records in > >>>>>> 65536+0 records out > >>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s > >>>>>> > >>>>>> See my ceph.conf: > >>>>>> > >>>>>> [global] > >>>>>> auth cluster required = cephx > >>>>>> auth service required = cephx > >>>>>> auth client required = cephx > >>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 > >>>>>> osd pool default pg num = 4096 > >>>>>> osd pool default pgp num = 4096 > >>>>>> osd pool default size = 2 > >>>>>> osd crush chooseleaf type = 0 > >>>>>> > >>>>>> debug lockdep = 0/0 > >>>>>> debug context = 0/0 > >>>>>> debug crush = 0/0 > >>>>>> debug buffer = 0/0 > >>>>>> debug timer = 0/0 > >>>>>> debug journaler = 0/0 > >>>>>> debug osd = 0/0 > >>>>>> debug optracker = 0/0 > >>>>>> debug objclass = 0/0 > >>>>>> debug filestore = 0/0 > >>>>>> debug journal = 0/0 > >>>>>> debug ms = 0/0 > >>>>>> debug monc = 0/0 > >>>>>> debug tp = 0/0 > >>>>>> debug auth = 0/0 > >>>>>> debug finisher = 0/0 > >>>>>> debug heartbeatmap = 0/0 > >>>>>> debug perfcounter = 0/0 > >>>>>> debug asok = 0/0 > >>>>>> debug throttle = 0/0 > >>>>>> > >>>>>> [mon] > >>>>>> mon osd down out interval = 600 > >>>>>> mon osd min down reporters = 13 > >>>>>> [mon.ceph-01] > >>>>>> host = ceph-01 > >>>>>> mon addr = 172.20.20.171 > >>>>>> [mon.ceph-02] > >>>>>> host = ceph-02 > >>>>>> mon addr = 172.20.20.172 > >>>>>> [mon.ceph-03] > >>>>>> host = ceph-03 > >>>>>> mon addr = 172.20.20.173 > >>>>>> > >>>>>> debug lockdep = 0/0 > >>>>>> debug context = 0/0 > >>>>>> debug crush = 0/0 > >>>>>> debug buffer = 0/0 > >>>>>> debug timer = 0/0 > >>>>>> debug journaler = 0/0 > >>>>>> debug osd = 0/0 > >>>>>> debug optracker = 0/0 > >>>>>> debug objclass = 0/0 > >>>>>> debug filestore = 0/0 > >>>>>> debug journal = 0/0 > >>>>>> debug ms = 0/0 > >>>>>> debug monc = 0/0 > >>>>>> debug tp = 0/0 > >>>>>> debug auth = 0/0 > >>>>>> debug finisher = 0/0 > >>>>>> debug heartbeatmap = 0/0 > >>>>>> debug perfcounter = 0/0 > >>>>>> debug asok = 0/0 > >>>>>> debug throttle = 0/0 > >>>>>> > >>>>>> [osd] > >>>>>> osd mkfs type = xfs > >>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = > >>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480 > >>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 > >>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore > >>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads > >>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max > >>>>>> backfills = 1 osd recovery op priority = 1 > >>>>>> > >>>>>> > >>>>>> debug lockdep = 0/0 > >>>>>> debug context = 0/0 > >>>>>> debug crush = 0/0 > >>>>>> debug buffer = 0/0 > >>>>>> debug timer = 0/0 > >>>>>> debug journaler = 0/0 > >>>>>> debug osd = 0/0 > >>>>>> debug optracker = 0/0 > >>>>>> debug objclass = 0/0 > >>>>>> debug filestore = 0/0 > >>>>>> debug journal = 0/0 > >>>>>> debug ms = 0/0 > >>>>>> debug monc = 0/0 > >>>>>> debug tp = 0/0 > >>>>>> debug auth = 0/0 > >>>>>> debug finisher = 0/0 > >>>>>> debug heartbeatmap = 0/0 > >>>>>> debug perfcounter = 0/0 > >>>>>> debug asok = 0/0 > >>>>>> debug throttle = 0/0 > >>>>>> > >>>>>> Disabling all debugging made me win 200/300 more IOPS. > >>>>>> > >>>>>> See my fio template: > >>>>>> > >>>>>> [global] > >>>>>> #logging > >>>>>> #write_iops_log=write_iops_log > >>>>>> #write_bw_log=write_bw_log > >>>>>> #write_lat_log=write_lat_lo > >>>>>> > >>>>>> time_based > >>>>>> runtime=60 > >>>>>> > >>>>>> ioengine=rbd > >>>>>> clientname=admin > >>>>>> pool=test > >>>>>> rbdname=fio > >>>>>> invalidate=0 # mandatory > >>>>>> #rw=randwrite > >>>>>> rw=write > >>>>>> bs=4k > >>>>>> #bs=32m > >>>>>> size=5G > >>>>>> group_reporting > >>>>>> > >>>>>> [rbd_iodepth32] > >>>>>> iodepth=32 > >>>>>> direct=1 > >>>>>> > >>>>>> See my rio output: > >>>>>> > >>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, > >>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process > >>>>>> rbd engine: RBD version: 0.1.8 > >>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] > >>>>>> [0/3219/0 iops] [eta 00m:00s] > >>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 > >>>>>> 00:28:26 2014 > >>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec slat > >>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec): > >>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28, > >>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec): > >>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ > >>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], > >>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560], > >>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120], > >>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], > >>>>>> | 99.99th=[28032] > >>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, > >>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, > >>>>>> 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, > >>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, > >>>>>> 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, > >>>>>> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, 8=0.4%, > >>>>>> 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued : > >>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0, > >>>>>> window=0, percentile=100.00%, depth=32 > >>>>>> > >>>>>> Run status group 0 (all jobs): > >>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, > >>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec > >>>>>> > >>>>>> Disk stats (read/write): > >>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, > >>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, > >>>>>> aggrutil=0.01% > >>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% > >>>>>> > >>>>>> I tried to tweak several parameters like: > >>>>>> > >>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 > >>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 > >>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 > >>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue > >>>>>> max ops = 2000 > >>>>>> > >>>>>> But didn't any improvement. > >>>>>> > >>>>>> Then I tried other things: > >>>>>> > >>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 > >>>>>> more IOPS but it's not a realistic workload anymore and not that > >>>>>> significant. > >>>>>> * adding another SSD for the journal, still getting 3,2K IOPS > >>>>>> * I tried with rbd bench and I also got 3K IOPS > >>>>>> * I ran the test on a client machine and then locally on the > >>>>>> server, still getting 3,2K IOPS > >>>>>> * put the journal in memory, still getting 3,2K IOPS > >>>>>> * with 2 clients running the test in parallel I got a total of > >>>>>> 3,6K IOPS but I don't seem to be able to go over > >>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 > >>>>>> journals on 1 SSD, got 4,5K IOPS YAY! > >>>>>> > >>>>>> Given the results of the last time it seems that something is limiting > >>>>>> the number of IOPS per OSD process. > >>>>>> > >>>>>> Running the test on a client or locally didn't show any difference. > >>>>>> So it looks to me that there is some contention within Ceph that might > >>>>>> cause this. > >>>>>> > >>>>>> I also ran perf and looked at the output, everything looks decent, but > >>>>>> someone might want to have a look at it :). > >>>>>> > >>>>>> We have been able to reproduce this on 3 distinct platforms with some > >>>>>> deviations (because of the hardware) but the behaviour is the same. > >>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an > >>>>>> 29K IOPS SSD is a bit frustrating :). > >>>>>> > >>>>>> Cheers. > >>>>>> ---- > >>>>>> Sébastien Han > >>>>>> Cloud Architect > >>>>>> > >>>>>> "Always give 100%. Unless you're giving blood." > >>>>>> > >>>>>> Phone: +33 (0)1 49 70 99 72 > >>>>>> Mail: > >>>>>> sebastien....@enovance.com > >>>>>> > >>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : > >>>>>> www.enovance.com > >>>>>> - Twitter : @enovance > >>>>>> > >>>>>> > >>>>> Cheers. > >>>>> ---- > >>>>> Sébastien Han > >>>>> Cloud Architect > >>>>> > >>>>> "Always give 100%. Unless you're giving blood." > >>>>> > >>>>> Phone: +33 (0)1 49 70 99 72 > >>>>> Mail: > >>>>> sebastien....@enovance.com > >>>>> > >>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : > >>>>> www.enovance.com > >>>>> - Twitter : @enovance > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> > >>>>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-u > >>>>> sers-ceph.com > >>>> -- > >>>> Cédric > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@lists.ceph.com > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> Cheers. > >>> ---- > >>> Sébastien Han > >>> Cloud Architect > >>> > >>> "Always give 100%. Unless you're giving blood." > >>> > >>> Phone: +33 (0)1 49 70 99 72 > >>> Mail: sebastien....@enovance.com > >>> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com > >>> - Twitter : @enovance > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > > Cédric > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > Cheers. > ---- > Sébastien Han > Cloud Architect > > "Always give 100%. Unless you're giving blood." > > Phone: +33 (0)1 49 70 99 72 > Mail: sebastien....@enovance.com > Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - > Twitter : @enovance > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
FJ-20140915-Best-Practice_Distributed-Intelligent-Storage_NVMe-SSD_fast-IC_v8_versions,ksp.pdf
Description: Adobe PDF document
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com