We haven't tried Giant yet... Thanks Jian
-----Original Message----- From: Sebastien Han [mailto:sebastien....@enovance.com] Sent: Tuesday, September 23, 2014 11:42 PM To: Zhang, Jian Cc: Alexandre DERUMIER; ceph-users@lists.ceph.com Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS What about writes with Giant? On 18 Sep 2014, at 08:12, Zhang, Jian <jian.zh...@intel.com> wrote: > Have anyone ever testing multi volume performance on a *FULL* SSD setup? > We are able to get ~18K IOPS for 4K random read on a single volume with fio > (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) > IOPS even with multiple volumes. > Seems the maximum random write performance we can get on the entire cluster > is quite close to single volume performance. > > Thanks > Jian > > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Sebastien Han > Sent: Tuesday, September 16, 2014 9:33 PM > To: Alexandre DERUMIER > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go > over 3, 2K IOPS > > Hi, > > Thanks for keeping us updated on this subject. > dsync is definitely killing the ssd. > > I don't have much to add, I'm just surprised that you're only getting 5299 > with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, > that might explain this. > > > On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderum...@odiso.com> wrote: > >> here the results for the intel s3500 >> ------------------------------------ >> max performance is with ceph 0.85 + optracker disabled. >> intel s3500 don't have d_sync problem like crucial >> >> %util show almost 100% for read and write, so maybe the ssd disk performance >> is the limit. >> >> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to >> bench them next week. >> >> >> >> >> >> >> INTEL s3500 >> ----------- >> raw disk >> -------- >> >> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k >> --iodepth=32 --group_reporting --invalidate=0 --name=abc >> --ioengine=aio bw=288207KB/s, iops=72051 >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 >> 30,96 0,42 0,42 0,00 0,01 99,90 >> >> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k >> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio >> --sync=1 bw=48131KB/s, iops=12032 >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 >> 2,08 0,09 0,00 0,09 0,04 100,00 >> >> >> ceph 0.80 >> --------- >> randread: no tuning: bw=24578KB/s, iops=6144 >> >> >> randwrite: bw=10358KB/s, iops=2589 >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 >> 1,63 0,18 0,00 0,18 0,06 50,90 >> >> >> ceph 0.85 : >> --------- >> >> randread : bw=41406KB/s, iops=10351 >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 >> 1,36 0,13 0,13 0,00 0,07 75,90 >> >> randwrite : bw=17204KB/s, iops=4301 >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 >> 1,46 0,15 0,00 0,15 0,07 67,80 >> >> >> ceph 0.85 tuning op_tracker=false >> ---------------- >> >> randread : bw=86537KB/s, iops=21634 >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 >> 3,13 0,15 0,15 0,00 0,05 98,00 >> >> randwrite: bw=21199KB/s, iops=5299 >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz >> avgqu-sz await r_await w_await svctm %util >> sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 >> 2,09 0,21 0,00 0,21 0,07 80,00 >> >> >> ----- Mail original ----- >> >> De: "Alexandre DERUMIER" <aderum...@odiso.com> >> À: "Cedric Lemarchand" <ced...@yipikai.org> >> Cc: ceph-users@lists.ceph.com >> Envoyé: Vendredi 12 Septembre 2014 08:15:08 >> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over >> 3, 2K IOPS >> >> results of fio on rbd with kernel patch >> >> >> >> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same >> result): >> --------------------------- >> bw=12327KB/s, iops=3081 >> >> So no much better than before, but this time, iostat show only 15% >> utils, and latencies are lower >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await >> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 >> 23,90 0,29 0,10 0,00 0,10 0,05 15,20 >> >> >> So, the write bottleneck seem to be in ceph. >> >> >> >> I will send s3500 result today >> >> ----- Mail original ----- >> >> De: "Alexandre DERUMIER" <aderum...@odiso.com> >> À: "Cedric Lemarchand" <ced...@yipikai.org> >> Cc: ceph-users@lists.ceph.com >> Envoyé: Vendredi 12 Septembre 2014 07:58:05 >> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over >> 3, 2K IOPS >> >>>> For crucial, I'll try to apply the patch from stefan priebe, to >>>> ignore flushes (as crucial m550 have supercaps) >>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0 >>>> 3 >>>> 5707.html >> Here the results, disable cache flush >> >> crucial m550 >> ------------ >> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 >> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, >> iops=44393 >> >> >> ----- Mail original ----- >> >> De: "Alexandre DERUMIER" <aderum...@odiso.com> >> À: "Cedric Lemarchand" <ced...@yipikai.org> >> Cc: ceph-users@lists.ceph.com >> Envoyé: Vendredi 12 Septembre 2014 04:55:21 >> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over >> 3, 2K IOPS >> >> Hi, >> seem that intel s3500 perform a lot better with o_dsync >> >> crucial m550 >> ------------ >> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 >> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s, >> iops=312 >> >> intel s3500 >> ----------- >> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 >> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s, >> iops=10448 >> >> ok, so 30x faster. >> >> >> >> For crucial, I have try to apply the patch from stefan priebe, to >> ignore flushes (as crucial m550 have supercaps) >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035 >> 7 07.html Coming from zfs, this sound like "zfs_nocacheflush" >> >> Now results: >> >> crucial m550 >> ------------ >> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 >> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, >> iops=44393 >> >> >> >> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same >> result): >> --------------------------- >> bw=12327KB/s, iops=3081 >> >> So no much better than before, but this time, iostat show only 15% >> utils, and latencies are lower >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await >> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 >> 23,90 0,29 0,10 0,00 0,10 0,05 15,20 >> >> >> So, the write bottleneck seem to be in ceph. >> >> >> >> I will send s3500 result today >> >> ----- Mail original ----- >> >> De: "Cedric Lemarchand" <ced...@yipikai.org> >> À: ceph-users@lists.ceph.com >> Envoyé: Jeudi 11 Septembre 2014 21:23:23 >> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over >> 3, 2K IOPS >> >> >> Le 11/09/2014 19:33, Cedric Lemarchand a écrit : >>> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : >>>> Hi Sebastien, >>>> >>>> here my first results with crucial m550 (I'll send result with intel s3500 >>>> later): >>>> >>>> - 3 nodes >>>> - dell r620 without expander backplane >>>> - sas controller : lsi LSI 9207 (no hardware raid or cache) >>>> - 2 x E5-2603v2 1.8GHz (4cores) >>>> - 32GB ram >>>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. >>>> >>>> -os : debian wheezy, with kernel 3.10 >>>> >>>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial >>>> m550 (1TB). >>>> >>>> >>>> 3mon in the ceph cluster, >>>> and 1 osd (journal and datas on same disk) >>>> >>>> >>>> ceph.conf >>>> --------- >>>> debug_lockdep = 0/0 >>>> debug_context = 0/0 >>>> debug_crush = 0/0 >>>> debug_buffer = 0/0 >>>> debug_timer = 0/0 >>>> debug_filer = 0/0 >>>> debug_objecter = 0/0 >>>> debug_rados = 0/0 >>>> debug_rbd = 0/0 >>>> debug_journaler = 0/0 >>>> debug_objectcatcher = 0/0 >>>> debug_client = 0/0 >>>> debug_osd = 0/0 >>>> debug_optracker = 0/0 >>>> debug_objclass = 0/0 >>>> debug_filestore = 0/0 >>>> debug_journal = 0/0 >>>> debug_ms = 0/0 >>>> debug_monc = 0/0 >>>> debug_tp = 0/0 >>>> debug_auth = 0/0 >>>> debug_finisher = 0/0 >>>> debug_heartbeatmap = 0/0 >>>> debug_perfcounter = 0/0 >>>> debug_asok = 0/0 >>>> debug_throttle = 0/0 >>>> debug_mon = 0/0 >>>> debug_paxos = 0/0 >>>> debug_rgw = 0/0 >>>> osd_op_threads = 5 >>>> filestore_op_threads = 4 >>>> >>>> ms_nocrc = true >>>> cephx sign messages = false >>>> cephx require signatures = false >>>> >>>> ms_dispatch_throttle_bytes = 0 >>>> >>>> #0.85 >>>> throttler_perf_counter = false >>>> filestore_fd_cache_size = 64 >>>> filestore_fd_cache_shards = 32 >>>> osd_op_num_threads_per_shard = 1 >>>> osd_op_num_shards = 25 >>>> osd_enable_op_tracker = true >>>> >>>> >>>> >>>> Fio disk 4K benchmark >>>> ------------------ >>>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread >>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc >>>> --ioengine=aio bw=271755KB/s, iops=67938 >>>> >>>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite >>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc >>>> --ioengine=aio bw=228293KB/s, iops=57073 >>>> >>>> >>>> >>>> fio osd benchmark (through librbd) >>>> ---------------------------------- >>>> [global] >>>> ioengine=rbd >>>> clientname=admin >>>> pool=test >>>> rbdname=test >>>> invalidate=0 # mandatory >>>> rw=randwrite >>>> rw=randread >>>> bs=4k >>>> direct=1 >>>> numjobs=4 >>>> group_reporting=1 >>>> >>>> [rbd_iodepth32] >>>> iodepth=32 >>>> >>>> >>>> >>>> FIREFLY RESULTS >>>> ---------------- >>>> fio randwrite : bw=5009.6KB/s, iops=1252 >>>> >>>> fio randread: bw=37820KB/s, iops=9455 >>>> >>>> >>>> >>>> O.85 RESULTS >>>> ------------ >>>> >>>> fio randwrite : bw=11658KB/s, iops=2914 >>>> >>>> fio randread : bw=38642KB/s, iops=9660 >>>> >>>> >>>> >>>> 0.85 + osd_enable_op_tracker=false >>>> ----------------------------------- >>>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : >>>> bw=80606KB/s, iops=20151, (cpu 100% - GREAT !) >>>> >>>> >>>> >>>> So, for read, seem that osd_enable_op_tracker is the bottleneck. >>>> >>>> >>>> Now for write, I really don't understand why it's so low. >>>> >>>> >>>> I have done some iostat: >>>> >>>> >>>> FIO directly on /dev/sdb >>>> bw=228293KB/s, iops=57073 >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await >>>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00 >>>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00 >>>> >>>> >>>> FIO directly on osd through librbd >>>> bw=11658KB/s, iops=2914 >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await >>>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00 >>>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70 >>>> >>>> >>>> (I don't understand what exactly is %util, 100% in the 2 cases, >>>> because 10x slower with ceph) >>> It would be interesting if you could catch the size of writes on SSD >>> during the bench through librbd (I know nmon can do that) >> Replying to myself ... I ask a bit quickly in the way we already have >> this information (29678 / 5225 = 5,68Ko), but this is irrelevant. >> >> Cheers >> >>>> It could be a dsync problem, result seem pretty poor >>>> >>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct >>>> 65536+0 enregistrements lus >>>> 65536+0 enregistrements écrits >>>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s >>>> >>>> >>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct >>>> ^C17228+0 enregistrements lus >>>> 17228+0 enregistrements écrits >>>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s >>>> >>>> >>>> >>>> I'll do tests with intel s3500 tomorrow to compare >>>> >>>> ----- Mail original ----- >>>> >>>> De: "Sebastien Han" <sebastien....@enovance.com> >>>> À: "Warren Wang" <warren_w...@cable.comcast.com> >>>> Cc: ceph-users@lists.ceph.com >>>> Envoyé: Lundi 8 Septembre 2014 22:58:25 >>>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go >>>> over 3, 2K IOPS >>>> >>>> They definitely are Warren! >>>> >>>> Thanks for bringing this here :). >>>> >>>> On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> >>>> wrote: >>>> >>>>> +1 to what Cedric said. >>>>> >>>>> Anything more than a few minutes of heavy sustained writes tended to get >>>>> our solid state devices into a state where garbage collection could not >>>>> keep up. Originally we used small SSDs and did not overprovision the >>>>> journals by much. Manufacturers publish their SSD stats, and then in very >>>>> small font, state that the attained IOPS are with empty drives, and the >>>>> tests are only run for very short amounts of time. Even if the drives are >>>>> new, it's a good idea to perform an hdparm secure erase on them (so that >>>>> the SSD knows that the blocks are truly unused), and then overprovision >>>>> them. You'll know if you have a problem by watching for utilization and >>>>> wait data on the journals. >>>>> >>>>> One of the other interesting performance issues is that the Intel 10Gbe >>>>> NICs + default kernel that we typically use max out around 1million >>>>> packets/sec. It's worth tracking this metric to if you are close. >>>>> >>>>> I know these aren't necessarily relevant to the test parameters you gave >>>>> below, but they're worth keeping in mind. >>>>> >>>>> -- >>>>> Warren Wang >>>>> Comcast Cloud (OpenStack) >>>>> >>>>> >>>>> From: Cedric Lemarchand <ced...@yipikai.org> >>>>> Date: Wednesday, September 3, 2014 at 5:14 PM >>>>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> >>>>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go >>>>> over 3, 2K IOPS >>>>> >>>>> >>>>> Le 03/09/2014 22:11, Sebastien Han a écrit : >>>>>> Hi Warren, >>>>>> >>>>>> What do mean exactly by secure erase? At the firmware level with >>>>>> constructor softwares? >>>>>> SSDs were pretty new so I don't we hit that sort of things. I believe >>>>>> that only aged SSDs have this behaviour but I might be wrong. >>>>>> >>>>> Sorry I forgot to reply to the real question ;-) So yes it only >>>>> plays after some times, for your case, if the SSD still delivers write >>>>> IOPS specified by the manufacturer, it will doesn't help in any ways. >>>>> >>>>> But it seems this practice is nowadays increasingly used. >>>>> >>>>> Cheers >>>>>> On 02 Sep 2014, at 18:23, Wang, Warren >>>>>> <warren_w...@cable.comcast.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>>> Hi Sebastien, >>>>>>> >>>>>>> Something I didn't see in the thread so far, did you secure erase the >>>>>>> SSDs before they got used? I assume these were probably repurposed for >>>>>>> this test. We have seen some pretty significant garbage collection >>>>>>> issue on various SSD and other forms of solid state storage to the >>>>>>> point where we are overprovisioning pretty much every solid state >>>>>>> device now. By as much as 50% to handle sustained write operations. >>>>>>> Especially important for the journals, as we've found. >>>>>>> >>>>>>> Maybe not an issue on the short fio run below, but certainly evident on >>>>>>> longer runs or lots of historical data on the drives. The max >>>>>>> transaction time looks pretty good for your test. Something to consider >>>>>>> though. >>>>>>> >>>>>>> Warren >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: ceph-users [ >>>>>>> mailto:ceph-users-boun...@lists.ceph.com >>>>>>> ] On Behalf Of Sebastien Han >>>>>>> Sent: Thursday, August 28, 2014 12:12 PM >>>>>>> To: ceph-users >>>>>>> Cc: Mark Nelson >>>>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go >>>>>>> over 3, 2K IOPS >>>>>>> >>>>>>> Hey all, >>>>>>> >>>>>>> It has been a while since the last thread performance related on the ML >>>>>>> :p I've been running some experiment to see how much I can get from an >>>>>>> SSD on a Ceph cluster. >>>>>>> To achieve that I did something pretty simple: >>>>>>> >>>>>>> * Debian wheezy 7.6 >>>>>>> * kernel from debian 3.14-0.bpo.2-amd64 >>>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a >>>>>>> real deployment i'll use 3) >>>>>>> * 1 OSD backed by an SSD (journal and osd data on the same >>>>>>> device) >>>>>>> * 1 replica count of 1 >>>>>>> * partitions are perfectly aligned >>>>>>> * io scheduler is set to noon but deadline was showing the same >>>>>>> results >>>>>>> * no updatedb running >>>>>>> >>>>>>> About the box: >>>>>>> >>>>>>> * 32GB of RAM >>>>>>> * 12 cores with HT @ 2,4 GHz >>>>>>> * WB cache is enabled on the controller >>>>>>> * 10Gbps network (doesn't help here) >>>>>>> >>>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around >>>>>>> 29K iops with random 4k writes (my fio results) As a benchmark tool I >>>>>>> used fio with the rbd engine (thanks deutsche telekom guys!). >>>>>>> >>>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: >>>>>>> >>>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 >>>>>>> 65536+0 records in >>>>>>> 65536+0 records out >>>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s >>>>>>> >>>>>>> # du -sh rand.file >>>>>>> 256M rand.file >>>>>>> >>>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 >>>>>>> oflag=dsync,direct >>>>>>> 65536+0 records in >>>>>>> 65536+0 records out >>>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s >>>>>>> >>>>>>> See my ceph.conf: >>>>>>> >>>>>>> [global] >>>>>>> auth cluster required = cephx >>>>>>> auth service required = cephx >>>>>>> auth client required = cephx >>>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 >>>>>>> osd pool default pg num = 4096 >>>>>>> osd pool default pgp num = 4096 >>>>>>> osd pool default size = 2 >>>>>>> osd crush chooseleaf type = 0 >>>>>>> >>>>>>> debug lockdep = 0/0 >>>>>>> debug context = 0/0 >>>>>>> debug crush = 0/0 >>>>>>> debug buffer = 0/0 >>>>>>> debug timer = 0/0 >>>>>>> debug journaler = 0/0 >>>>>>> debug osd = 0/0 >>>>>>> debug optracker = 0/0 >>>>>>> debug objclass = 0/0 >>>>>>> debug filestore = 0/0 >>>>>>> debug journal = 0/0 >>>>>>> debug ms = 0/0 >>>>>>> debug monc = 0/0 >>>>>>> debug tp = 0/0 >>>>>>> debug auth = 0/0 >>>>>>> debug finisher = 0/0 >>>>>>> debug heartbeatmap = 0/0 >>>>>>> debug perfcounter = 0/0 >>>>>>> debug asok = 0/0 >>>>>>> debug throttle = 0/0 >>>>>>> >>>>>>> [mon] >>>>>>> mon osd down out interval = 600 >>>>>>> mon osd min down reporters = 13 >>>>>>> [mon.ceph-01] >>>>>>> host = ceph-01 >>>>>>> mon addr = 172.20.20.171 >>>>>>> [mon.ceph-02] >>>>>>> host = ceph-02 >>>>>>> mon addr = 172.20.20.172 >>>>>>> [mon.ceph-03] >>>>>>> host = ceph-03 >>>>>>> mon addr = 172.20.20.173 >>>>>>> >>>>>>> debug lockdep = 0/0 >>>>>>> debug context = 0/0 >>>>>>> debug crush = 0/0 >>>>>>> debug buffer = 0/0 >>>>>>> debug timer = 0/0 >>>>>>> debug journaler = 0/0 >>>>>>> debug osd = 0/0 >>>>>>> debug optracker = 0/0 >>>>>>> debug objclass = 0/0 >>>>>>> debug filestore = 0/0 >>>>>>> debug journal = 0/0 >>>>>>> debug ms = 0/0 >>>>>>> debug monc = 0/0 >>>>>>> debug tp = 0/0 >>>>>>> debug auth = 0/0 >>>>>>> debug finisher = 0/0 >>>>>>> debug heartbeatmap = 0/0 >>>>>>> debug perfcounter = 0/0 >>>>>>> debug asok = 0/0 >>>>>>> debug throttle = 0/0 >>>>>>> >>>>>>> [osd] >>>>>>> osd mkfs type = xfs >>>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = >>>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480 >>>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 >>>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore >>>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads >>>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max >>>>>>> backfills = 1 osd recovery op priority = 1 >>>>>>> >>>>>>> >>>>>>> debug lockdep = 0/0 >>>>>>> debug context = 0/0 >>>>>>> debug crush = 0/0 >>>>>>> debug buffer = 0/0 >>>>>>> debug timer = 0/0 >>>>>>> debug journaler = 0/0 >>>>>>> debug osd = 0/0 >>>>>>> debug optracker = 0/0 >>>>>>> debug objclass = 0/0 >>>>>>> debug filestore = 0/0 >>>>>>> debug journal = 0/0 >>>>>>> debug ms = 0/0 >>>>>>> debug monc = 0/0 >>>>>>> debug tp = 0/0 >>>>>>> debug auth = 0/0 >>>>>>> debug finisher = 0/0 >>>>>>> debug heartbeatmap = 0/0 >>>>>>> debug perfcounter = 0/0 >>>>>>> debug asok = 0/0 >>>>>>> debug throttle = 0/0 >>>>>>> >>>>>>> Disabling all debugging made me win 200/300 more IOPS. >>>>>>> >>>>>>> See my fio template: >>>>>>> >>>>>>> [global] >>>>>>> #logging >>>>>>> #write_iops_log=write_iops_log >>>>>>> #write_bw_log=write_bw_log >>>>>>> #write_lat_log=write_lat_lo >>>>>>> >>>>>>> time_based >>>>>>> runtime=60 >>>>>>> >>>>>>> ioengine=rbd >>>>>>> clientname=admin >>>>>>> pool=test >>>>>>> rbdname=fio >>>>>>> invalidate=0 # mandatory >>>>>>> #rw=randwrite >>>>>>> rw=write >>>>>>> bs=4k >>>>>>> #bs=32m >>>>>>> size=5G >>>>>>> group_reporting >>>>>>> >>>>>>> [rbd_iodepth32] >>>>>>> iodepth=32 >>>>>>> direct=1 >>>>>>> >>>>>>> See my rio output: >>>>>>> >>>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, >>>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process >>>>>>> rbd engine: RBD version: 0.1.8 >>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] >>>>>>> [0/3219/0 iops] [eta 00m:00s] >>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug >>>>>>> 28 >>>>>>> 00:28:26 2014 >>>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec >>>>>>> slat >>>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec): >>>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28, >>>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec): >>>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ >>>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], >>>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560], >>>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120], >>>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], >>>>>>> | 99.99th=[28032] >>>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, >>>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, >>>>>>> 20=39.24%, 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, >>>>>>> majf=0, >>>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, >>>>>>> 16=33.9%, 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, >>>>>>> 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, >>>>>>> 8=0.4%, 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued : >>>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0, >>>>>>> window=0, percentile=100.00%, depth=32 >>>>>>> >>>>>>> Run status group 0 (all jobs): >>>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, >>>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec >>>>>>> >>>>>>> Disk stats (read/write): >>>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, >>>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, >>>>>>> aggrutil=0.01% >>>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% >>>>>>> >>>>>>> I tried to tweak several parameters like: >>>>>>> >>>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 >>>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 >>>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 >>>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore >>>>>>> queue max ops = 2000 >>>>>>> >>>>>>> But didn't any improvement. >>>>>>> >>>>>>> Then I tried other things: >>>>>>> >>>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 >>>>>>> more IOPS but it's not a realistic workload anymore and not that >>>>>>> significant. >>>>>>> * adding another SSD for the journal, still getting 3,2K IOPS >>>>>>> * I tried with rbd bench and I also got 3K IOPS >>>>>>> * I ran the test on a client machine and then locally on the >>>>>>> server, still getting 3,2K IOPS >>>>>>> * put the journal in memory, still getting 3,2K IOPS >>>>>>> * with 2 clients running the test in parallel I got a total of >>>>>>> 3,6K IOPS but I don't seem to be able to go over >>>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 >>>>>>> journals on 1 SSD, got 4,5K IOPS YAY! >>>>>>> >>>>>>> Given the results of the last time it seems that something is limiting >>>>>>> the number of IOPS per OSD process. >>>>>>> >>>>>>> Running the test on a client or locally didn't show any difference. >>>>>>> So it looks to me that there is some contention within Ceph that might >>>>>>> cause this. >>>>>>> >>>>>>> I also ran perf and looked at the output, everything looks decent, but >>>>>>> someone might want to have a look at it :). >>>>>>> >>>>>>> We have been able to reproduce this on 3 distinct platforms with some >>>>>>> deviations (because of the hardware) but the behaviour is the same. >>>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an >>>>>>> 29K IOPS SSD is a bit frustrating :). >>>>>>> >>>>>>> Cheers. >>>>>>> ---- >>>>>>> Sébastien Han >>>>>>> Cloud Architect >>>>>>> >>>>>>> "Always give 100%. Unless you're giving blood." >>>>>>> >>>>>>> Phone: +33 (0)1 49 70 99 72 >>>>>>> Mail: >>>>>>> sebastien....@enovance.com >>>>>>> >>>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : >>>>>>> www.enovance.com >>>>>>> - Twitter : @enovance >>>>>>> >>>>>>> >>>>>> Cheers. >>>>>> ---- >>>>>> Sébastien Han >>>>>> Cloud Architect >>>>>> >>>>>> "Always give 100%. Unless you're giving blood." >>>>>> >>>>>> Phone: +33 (0)1 49 70 99 72 >>>>>> Mail: >>>>>> sebastien....@enovance.com >>>>>> >>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : >>>>>> www.enovance.com >>>>>> - Twitter : @enovance >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> >>>>>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph- >>>>>> u >>>>>> sers-ceph.com >>>>> -- >>>>> Cédric >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> Cheers. >>>> ---- >>>> Sébastien Han >>>> Cloud Architect >>>> >>>> "Always give 100%. Unless you're giving blood." >>>> >>>> Phone: +33 (0)1 49 70 99 72 >>>> Mail: sebastien....@enovance.com >>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : >>>> www.enovance.com >>>> - Twitter : @enovance >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> -- >> Cédric >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > Cheers. > ---- > Sébastien Han > Cloud Architect > > "Always give 100%. Unless you're giving blood." > > Phone: +33 (0)1 49 70 99 72 > Mail: sebastien....@enovance.com > Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - > Twitter : @enovance > Cheers. ---- Sébastien Han Cloud Architect "Always give 100%. Unless you're giving blood." Phone: +33 (0)1 49 70 99 72 Mail: sebastien....@enovance.com Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - Twitter : @enovance _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com