Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Alexandre DERUMIER Wed, 24 Sep 2014 11:49:48 -0700

>>What about writes with Giant?

I'm around 
- 4k  iops (4k random) with 1osd  (1 node - 1 osd)
- 8k  iops (4k random) with 2 osd  (1 node - 2 osd)
- 16K iops (4k random) with 4 osd (2 nodes - 2 osd by node)
- 22K iops (4k random) with 6 osd (3 nodes - 2 osd by node)


Seem to scale, but I'm cpu bound on node (8 cores E5-2603 v2 @ 1.80GHz 100% cpu 
for 2 osd)

----- Mail original ----- 

De: "Sebastien Han" <sebastien....@enovance.com> 
À: "Jian Zhang" <jian.zh...@intel.com> 
Cc: "Alexandre DERUMIER" <aderum...@odiso.com>, ceph-users@lists.ceph.com 
Envoyé: Mardi 23 Septembre 2014 17:41:38 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

What about writes with Giant? 

On 18 Sep 2014, at 08:12, Zhang, Jian <jian.zh...@intel.com> wrote: 

> Have anyone ever testing multi volume performance on a *FULL* SSD setup? 
> We are able to get ~18K IOPS for 4K random read on a single volume with fio 
> (with rbd engine) on a 12x DC3700 Setup, but only able to get ~23K (peak) 
> IOPS even with multiple volumes. 
> Seems the maximum random write performance we can get on the entire cluster 
> is quite close to single volume performance. 
> 
> Thanks 
> Jian 
> 
> 
> -----Original Message----- 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Sebastien Han 
> Sent: Tuesday, September 16, 2014 9:33 PM 
> To: Alexandre DERUMIER 
> Cc: ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
> IOPS 
> 
> Hi, 
> 
> Thanks for keeping us updated on this subject. 
> dsync is definitely killing the ssd. 
> 
> I don't have much to add, I'm just surprised that you're only getting 5299 
> with 0.85 since I've been able to get 6,4K, well I was using the 200GB model, 
> that might explain this. 
> 
> 
> On 12 Sep 2014, at 16:32, Alexandre DERUMIER <aderum...@odiso.com> wrote: 
> 
>> here the results for the intel s3500 
>> ------------------------------------ 
>> max performance is with ceph 0.85 + optracker disabled. 
>> intel s3500 don't have d_sync problem like crucial 
>> 
>> %util show almost 100% for read and write, so maybe the ssd disk performance 
>> is the limit. 
>> 
>> I have some stec zeusram 8GB in stock (I used them for zfs zil), I'll try to 
>> bench them next week. 
>> 
>> 
>> 
>> 
>> 
>> 
>> INTEL s3500 
>> ----------- 
>> raw disk 
>> -------- 
>> 
>> randread: fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>> --ioengine=aio bw=288207KB/s, iops=72051 
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 0,00 73454,00 0,00 293816,00 0,00 8,00 30,96 0,42 0,42 0,00 0,01 
>> 99,90 
>> 
>> randwrite: fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
>> --iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio 
>> --sync=1 bw=48131KB/s, iops=12032 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 0,00 0,00 24120,00 0,00 48240,00 4,00 2,08 0,09 0,00 0,09 0,04 
>> 100,00 
>> 
>> 
>> ceph 0.80 
>> --------- 
>> randread: no tuning: bw=24578KB/s, iops=6144 
>> 
>> 
>> randwrite: bw=10358KB/s, iops=2589 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 373,00 0,00 8878,00 0,00 34012,50 7,66 1,63 0,18 0,00 0,18 0,06 
>> 50,90 
>> 
>> 
>> ceph 0.85 : 
>> --------- 
>> 
>> randread : bw=41406KB/s, iops=10351 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 2,00 0,00 10425,00 0,00 41816,00 0,00 8,02 1,36 0,13 0,13 0,00 0,07 
>> 75,90 
>> 
>> randwrite : bw=17204KB/s, iops=4301 
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 333,00 0,00 9788,00 0,00 57909,00 11,83 1,46 0,15 0,00 0,15 0,07 
>> 67,80 
>> 
>> 
>> ceph 0.85 tuning op_tracker=false 
>> ---------------- 
>> 
>> randread : bw=86537KB/s, iops=21634 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 25,00 0,00 21428,00 0,00 86444,00 0,00 8,07 3,13 0,15 0,15 0,00 0,05 
>> 98,00 
>> 
>> randwrite: bw=21199KB/s, iops=5299 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
>> w_await svctm %util 
>> sdb 0,00 1563,00 0,00 9880,00 0,00 75223,50 15,23 2,09 0,21 0,00 0,21 0,07 
>> 80,00 
>> 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Alexandre DERUMIER" <aderum...@odiso.com> 
>> À: "Cedric Lemarchand" <ced...@yipikai.org> 
>> Cc: ceph-users@lists.ceph.com 
>> Envoyé: Vendredi 12 Septembre 2014 08:15:08 
>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS 
>> 
>> results of fio on rbd with kernel patch 
>> 
>> 
>> 
>> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
>> result): 
>> --------------------------- 
>> bw=12327KB/s, iops=3081 
>> 
>> So no much better than before, but this time, iostat show only 15% 
>> utils, and latencies are lower 
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 
>> 23,90 0,29 0,10 0,00 0,10 0,05 15,20 
>> 
>> 
>> So, the write bottleneck seem to be in ceph. 
>> 
>> 
>> 
>> I will send s3500 result today 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Alexandre DERUMIER" <aderum...@odiso.com> 
>> À: "Cedric Lemarchand" <ced...@yipikai.org> 
>> Cc: ceph-users@lists.ceph.com 
>> Envoyé: Vendredi 12 Septembre 2014 07:58:05 
>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS 
>> 
>>>> For crucial, I'll try to apply the patch from stefan priebe, to 
>>>> ignore flushes (as crucial m550 have supercaps) 
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/03 
>>>> 5707.html 
>> Here the results, disable cache flush 
>> 
>> crucial m550 
>> ------------ 
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, 
>> iops=44393 
>> 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Alexandre DERUMIER" <aderum...@odiso.com> 
>> À: "Cedric Lemarchand" <ced...@yipikai.org> 
>> Cc: ceph-users@lists.ceph.com 
>> Envoyé: Vendredi 12 Septembre 2014 04:55:21 
>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS 
>> 
>> Hi, 
>> seem that intel s3500 perform a lot better with o_dsync 
>> 
>> crucial m550 
>> ------------ 
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s, 
>> iops=312 
>> 
>> intel s3500 
>> ----------- 
>> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s, 
>> iops=10448 
>> 
>> ok, so 30x faster. 
>> 
>> 
>> 
>> For crucial, I have try to apply the patch from stefan priebe, to 
>> ignore flushes (as crucial m550 have supercaps) 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/0357 
>> 07.html Coming from zfs, this sound like "zfs_nocacheflush" 
>> 
>> Now results: 
>> 
>> crucial m550 
>> ------------ 
>> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
>> --group_reporting --invalidate=0 --name=ab --sync=1 bw=177575KB/s, 
>> iops=44393 
>> 
>> 
>> 
>> fio rbd crucial m550 1 osd 0.85 (osd_enable_op_tracker true or false, same 
>> result): 
>> --------------------------- 
>> bw=12327KB/s, iops=3081 
>> 
>> So no much better than before, but this time, iostat show only 15% 
>> utils, and latencies are lower 
>> 
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>> r_await w_await svctm %util sdb 0,00 29,00 0,00 3075,00 0,00 36748,50 
>> 23,90 0,29 0,10 0,00 0,10 0,05 15,20 
>> 
>> 
>> So, the write bottleneck seem to be in ceph. 
>> 
>> 
>> 
>> I will send s3500 result today 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Cedric Lemarchand" <ced...@yipikai.org> 
>> À: ceph-users@lists.ceph.com 
>> Envoyé: Jeudi 11 Septembre 2014 21:23:23 
>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 
>> 3, 2K IOPS 
>> 
>> 
>> Le 11/09/2014 19:33, Cedric Lemarchand a écrit : 
>>> Le 11/09/2014 08:20, Alexandre DERUMIER a écrit : 
>>>> Hi Sebastien, 
>>>> 
>>>> here my first results with crucial m550 (I'll send result with intel s3500 
>>>> later): 
>>>> 
>>>> - 3 nodes 
>>>> - dell r620 without expander backplane 
>>>> - sas controller : lsi LSI 9207 (no hardware raid or cache) 
>>>> - 2 x E5-2603v2 1.8GHz (4cores) 
>>>> - 32GB ram 
>>>> - network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication. 
>>>> 
>>>> -os : debian wheezy, with kernel 3.10 
>>>> 
>>>> os + ceph mon : 2x intel s3500 100gb linux soft raid osd : crucial 
>>>> m550 (1TB). 
>>>> 
>>>> 
>>>> 3mon in the ceph cluster, 
>>>> and 1 osd (journal and datas on same disk) 
>>>> 
>>>> 
>>>> ceph.conf 
>>>> --------- 
>>>> debug_lockdep = 0/0 
>>>> debug_context = 0/0 
>>>> debug_crush = 0/0 
>>>> debug_buffer = 0/0 
>>>> debug_timer = 0/0 
>>>> debug_filer = 0/0 
>>>> debug_objecter = 0/0 
>>>> debug_rados = 0/0 
>>>> debug_rbd = 0/0 
>>>> debug_journaler = 0/0 
>>>> debug_objectcatcher = 0/0 
>>>> debug_client = 0/0 
>>>> debug_osd = 0/0 
>>>> debug_optracker = 0/0 
>>>> debug_objclass = 0/0 
>>>> debug_filestore = 0/0 
>>>> debug_journal = 0/0 
>>>> debug_ms = 0/0 
>>>> debug_monc = 0/0 
>>>> debug_tp = 0/0 
>>>> debug_auth = 0/0 
>>>> debug_finisher = 0/0 
>>>> debug_heartbeatmap = 0/0 
>>>> debug_perfcounter = 0/0 
>>>> debug_asok = 0/0 
>>>> debug_throttle = 0/0 
>>>> debug_mon = 0/0 
>>>> debug_paxos = 0/0 
>>>> debug_rgw = 0/0 
>>>> osd_op_threads = 5 
>>>> filestore_op_threads = 4 
>>>> 
>>>> ms_nocrc = true 
>>>> cephx sign messages = false 
>>>> cephx require signatures = false 
>>>> 
>>>> ms_dispatch_throttle_bytes = 0 
>>>> 
>>>> #0.85 
>>>> throttler_perf_counter = false 
>>>> filestore_fd_cache_size = 64 
>>>> filestore_fd_cache_shards = 32 
>>>> osd_op_num_threads_per_shard = 1 
>>>> osd_op_num_shards = 25 
>>>> osd_enable_op_tracker = true 
>>>> 
>>>> 
>>>> 
>>>> Fio disk 4K benchmark 
>>>> ------------------ 
>>>> rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread 
>>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>>>> --ioengine=aio bw=271755KB/s, iops=67938 
>>>> 
>>>> rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite 
>>>> --bs=4k --iodepth=32 --group_reporting --invalidate=0 --name=abc 
>>>> --ioengine=aio bw=228293KB/s, iops=57073 
>>>> 
>>>> 
>>>> 
>>>> fio osd benchmark (through librbd) 
>>>> ---------------------------------- 
>>>> [global] 
>>>> ioengine=rbd 
>>>> clientname=admin 
>>>> pool=test 
>>>> rbdname=test 
>>>> invalidate=0 # mandatory 
>>>> rw=randwrite 
>>>> rw=randread 
>>>> bs=4k 
>>>> direct=1 
>>>> numjobs=4 
>>>> group_reporting=1 
>>>> 
>>>> [rbd_iodepth32] 
>>>> iodepth=32 
>>>> 
>>>> 
>>>> 
>>>> FIREFLY RESULTS 
>>>> ---------------- 
>>>> fio randwrite : bw=5009.6KB/s, iops=1252 
>>>> 
>>>> fio randread: bw=37820KB/s, iops=9455 
>>>> 
>>>> 
>>>> 
>>>> O.85 RESULTS 
>>>> ------------ 
>>>> 
>>>> fio randwrite : bw=11658KB/s, iops=2914 
>>>> 
>>>> fio randread : bw=38642KB/s, iops=9660 
>>>> 
>>>> 
>>>> 
>>>> 0.85 + osd_enable_op_tracker=false 
>>>> ----------------------------------- 
>>>> fio randwrite : bw=11630KB/s, iops=2907 fio randread : bw=80606KB/s, 
>>>> iops=20151, (cpu 100% - GREAT !) 
>>>> 
>>>> 
>>>> 
>>>> So, for read, seem that osd_enable_op_tracker is the bottleneck. 
>>>> 
>>>> 
>>>> Now for write, I really don't understand why it's so low. 
>>>> 
>>>> 
>>>> I have done some iostat: 
>>>> 
>>>> 
>>>> FIO directly on /dev/sdb 
>>>> bw=228293KB/s, iops=57073 
>>>> 
>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>>>> r_await w_await svctm %util sdb 0,00 0,00 0,00 63613,00 0,00 
>>>> 254452,00 8,00 31,24 0,49 0,00 0,49 0,02 100,00 
>>>> 
>>>> 
>>>> FIO directly on osd through librbd 
>>>> bw=11658KB/s, iops=2914 
>>>> 
>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await 
>>>> r_await w_await svctm %util sdb 0,00 355,00 0,00 5225,00 0,00 
>>>> 29678,00 11,36 57,63 11,03 0,00 11,03 0,19 99,70 
>>>> 
>>>> 
>>>> (I don't understand what exactly is %util, 100% in the 2 cases, 
>>>> because 10x slower with ceph) 
>>> It would be interesting if you could catch the size of writes on SSD 
>>> during the bench through librbd (I know nmon can do that) 
>> Replying to myself ... I ask a bit quickly in the way we already have 
>> this information (29678 / 5225 = 5,68Ko), but this is irrelevant. 
>> 
>> Cheers 
>> 
>>>> It could be a dsync problem, result seem pretty poor 
>>>> 
>>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct 
>>>> 65536+0 enregistrements lus 
>>>> 65536+0 enregistrements écrits 
>>>> 268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s 
>>>> 
>>>> 
>>>> # dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct 
>>>> ^C17228+0 enregistrements lus 
>>>> 17228+0 enregistrements écrits 
>>>> 70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s 
>>>> 
>>>> 
>>>> 
>>>> I'll do tests with intel s3500 tomorrow to compare 
>>>> 
>>>> ----- Mail original ----- 
>>>> 
>>>> De: "Sebastien Han" <sebastien....@enovance.com> 
>>>> À: "Warren Wang" <warren_w...@cable.comcast.com> 
>>>> Cc: ceph-users@lists.ceph.com 
>>>> Envoyé: Lundi 8 Septembre 2014 22:58:25 
>>>> Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go 
>>>> over 3, 2K IOPS 
>>>> 
>>>> They definitely are Warren! 
>>>> 
>>>> Thanks for bringing this here :). 
>>>> 
>>>> On 05 Sep 2014, at 23:02, Wang, Warren <warren_w...@cable.comcast.com> 
>>>> wrote: 
>>>> 
>>>>> +1 to what Cedric said. 
>>>>> 
>>>>> Anything more than a few minutes of heavy sustained writes tended to get 
>>>>> our solid state devices into a state where garbage collection could not 
>>>>> keep up. Originally we used small SSDs and did not overprovision the 
>>>>> journals by much. Manufacturers publish their SSD stats, and then in very 
>>>>> small font, state that the attained IOPS are with empty drives, and the 
>>>>> tests are only run for very short amounts of time. Even if the drives are 
>>>>> new, it's a good idea to perform an hdparm secure erase on them (so that 
>>>>> the SSD knows that the blocks are truly unused), and then overprovision 
>>>>> them. You'll know if you have a problem by watching for utilization and 
>>>>> wait data on the journals. 
>>>>> 
>>>>> One of the other interesting performance issues is that the Intel 10Gbe 
>>>>> NICs + default kernel that we typically use max out around 1million 
>>>>> packets/sec. It's worth tracking this metric to if you are close. 
>>>>> 
>>>>> I know these aren't necessarily relevant to the test parameters you gave 
>>>>> below, but they're worth keeping in mind. 
>>>>> 
>>>>> -- 
>>>>> Warren Wang 
>>>>> Comcast Cloud (OpenStack) 
>>>>> 
>>>>> 
>>>>> From: Cedric Lemarchand <ced...@yipikai.org> 
>>>>> Date: Wednesday, September 3, 2014 at 5:14 PM 
>>>>> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> 
>>>>> Subject: Re: [ceph-users] [Single OSD performance on SSD] Can't go 
>>>>> over 3, 2K IOPS 
>>>>> 
>>>>> 
>>>>> Le 03/09/2014 22:11, Sebastien Han a écrit : 
>>>>>> Hi Warren, 
>>>>>> 
>>>>>> What do mean exactly by secure erase? At the firmware level with 
>>>>>> constructor softwares? 
>>>>>> SSDs were pretty new so I don't we hit that sort of things. I believe 
>>>>>> that only aged SSDs have this behaviour but I might be wrong. 
>>>>>> 
>>>>> Sorry I forgot to reply to the real question ;-) So yes it only 
>>>>> plays after some times, for your case, if the SSD still delivers write 
>>>>> IOPS specified by the manufacturer, it will doesn't help in any ways. 
>>>>> 
>>>>> But it seems this practice is nowadays increasingly used. 
>>>>> 
>>>>> Cheers 
>>>>>> On 02 Sep 2014, at 18:23, Wang, Warren 
>>>>>> <warren_w...@cable.comcast.com> 
>>>>>> wrote: 
>>>>>> 
>>>>>> 
>>>>>>> Hi Sebastien, 
>>>>>>> 
>>>>>>> Something I didn't see in the thread so far, did you secure erase the 
>>>>>>> SSDs before they got used? I assume these were probably repurposed for 
>>>>>>> this test. We have seen some pretty significant garbage collection 
>>>>>>> issue on various SSD and other forms of solid state storage to the 
>>>>>>> point where we are overprovisioning pretty much every solid state 
>>>>>>> device now. By as much as 50% to handle sustained write operations. 
>>>>>>> Especially important for the journals, as we've found. 
>>>>>>> 
>>>>>>> Maybe not an issue on the short fio run below, but certainly evident on 
>>>>>>> longer runs or lots of historical data on the drives. The max 
>>>>>>> transaction time looks pretty good for your test. Something to consider 
>>>>>>> though. 
>>>>>>> 
>>>>>>> Warren 
>>>>>>> 
>>>>>>> -----Original Message----- 
>>>>>>> From: ceph-users [ 
>>>>>>> mailto:ceph-users-boun...@lists.ceph.com 
>>>>>>> ] On Behalf Of Sebastien Han 
>>>>>>> Sent: Thursday, August 28, 2014 12:12 PM 
>>>>>>> To: ceph-users 
>>>>>>> Cc: Mark Nelson 
>>>>>>> Subject: [ceph-users] [Single OSD performance on SSD] Can't go 
>>>>>>> over 3, 2K IOPS 
>>>>>>> 
>>>>>>> Hey all, 
>>>>>>> 
>>>>>>> It has been a while since the last thread performance related on the ML 
>>>>>>> :p I've been running some experiment to see how much I can get from an 
>>>>>>> SSD on a Ceph cluster. 
>>>>>>> To achieve that I did something pretty simple: 
>>>>>>> 
>>>>>>> * Debian wheezy 7.6 
>>>>>>> * kernel from debian 3.14-0.bpo.2-amd64 
>>>>>>> * 1 cluster, 3 mons (i'd like to keep this realistic since in a 
>>>>>>> real deployment i'll use 3) 
>>>>>>> * 1 OSD backed by an SSD (journal and osd data on the same 
>>>>>>> device) 
>>>>>>> * 1 replica count of 1 
>>>>>>> * partitions are perfectly aligned 
>>>>>>> * io scheduler is set to noon but deadline was showing the same 
>>>>>>> results 
>>>>>>> * no updatedb running 
>>>>>>> 
>>>>>>> About the box: 
>>>>>>> 
>>>>>>> * 32GB of RAM 
>>>>>>> * 12 cores with HT @ 2,4 GHz 
>>>>>>> * WB cache is enabled on the controller 
>>>>>>> * 10Gbps network (doesn't help here) 
>>>>>>> 
>>>>>>> The SSD is a 200G Intel DC S3700 and is capable of delivering around 
>>>>>>> 29K iops with random 4k writes (my fio results) As a benchmark tool I 
>>>>>>> used fio with the rbd engine (thanks deutsche telekom guys!). 
>>>>>>> 
>>>>>>> O_DIECT and D_SYNC don't seem to be a problem for the SSD: 
>>>>>>> 
>>>>>>> # dd if=/dev/urandom of=rand.file bs=4k count=65536 
>>>>>>> 65536+0 records in 
>>>>>>> 65536+0 records out 
>>>>>>> 268435456 bytes (268 MB) copied, 29.5477 s, 9.1 MB/s 
>>>>>>> 
>>>>>>> # du -sh rand.file 
>>>>>>> 256M rand.file 
>>>>>>> 
>>>>>>> # dd if=rand.file of=/dev/sdo bs=4k count=65536 
>>>>>>> oflag=dsync,direct 
>>>>>>> 65536+0 records in 
>>>>>>> 65536+0 records out 
>>>>>>> 268435456 bytes (268 MB) copied, 2.73628 s, 98.1 MB/s 
>>>>>>> 
>>>>>>> See my ceph.conf: 
>>>>>>> 
>>>>>>> [global] 
>>>>>>> auth cluster required = cephx 
>>>>>>> auth service required = cephx 
>>>>>>> auth client required = cephx 
>>>>>>> fsid = 857b8609-8c9b-499e-9161-2ea67ba51c97 
>>>>>>> osd pool default pg num = 4096 
>>>>>>> osd pool default pgp num = 4096 
>>>>>>> osd pool default size = 2 
>>>>>>> osd crush chooseleaf type = 0 
>>>>>>> 
>>>>>>> debug lockdep = 0/0 
>>>>>>> debug context = 0/0 
>>>>>>> debug crush = 0/0 
>>>>>>> debug buffer = 0/0 
>>>>>>> debug timer = 0/0 
>>>>>>> debug journaler = 0/0 
>>>>>>> debug osd = 0/0 
>>>>>>> debug optracker = 0/0 
>>>>>>> debug objclass = 0/0 
>>>>>>> debug filestore = 0/0 
>>>>>>> debug journal = 0/0 
>>>>>>> debug ms = 0/0 
>>>>>>> debug monc = 0/0 
>>>>>>> debug tp = 0/0 
>>>>>>> debug auth = 0/0 
>>>>>>> debug finisher = 0/0 
>>>>>>> debug heartbeatmap = 0/0 
>>>>>>> debug perfcounter = 0/0 
>>>>>>> debug asok = 0/0 
>>>>>>> debug throttle = 0/0 
>>>>>>> 
>>>>>>> [mon] 
>>>>>>> mon osd down out interval = 600 
>>>>>>> mon osd min down reporters = 13 
>>>>>>> [mon.ceph-01] 
>>>>>>> host = ceph-01 
>>>>>>> mon addr = 172.20.20.171 
>>>>>>> [mon.ceph-02] 
>>>>>>> host = ceph-02 
>>>>>>> mon addr = 172.20.20.172 
>>>>>>> [mon.ceph-03] 
>>>>>>> host = ceph-03 
>>>>>>> mon addr = 172.20.20.173 
>>>>>>> 
>>>>>>> debug lockdep = 0/0 
>>>>>>> debug context = 0/0 
>>>>>>> debug crush = 0/0 
>>>>>>> debug buffer = 0/0 
>>>>>>> debug timer = 0/0 
>>>>>>> debug journaler = 0/0 
>>>>>>> debug osd = 0/0 
>>>>>>> debug optracker = 0/0 
>>>>>>> debug objclass = 0/0 
>>>>>>> debug filestore = 0/0 
>>>>>>> debug journal = 0/0 
>>>>>>> debug ms = 0/0 
>>>>>>> debug monc = 0/0 
>>>>>>> debug tp = 0/0 
>>>>>>> debug auth = 0/0 
>>>>>>> debug finisher = 0/0 
>>>>>>> debug heartbeatmap = 0/0 
>>>>>>> debug perfcounter = 0/0 
>>>>>>> debug asok = 0/0 
>>>>>>> debug throttle = 0/0 
>>>>>>> 
>>>>>>> [osd] 
>>>>>>> osd mkfs type = xfs 
>>>>>>> osd mkfs options xfs = -f -i size=2048 osd mount options xfs = 
>>>>>>> rw,noatime,logbsize=256k,delaylog osd journal size = 20480 
>>>>>>> cluster_network = 172.20.20.0/24 public_network = 172.20.20.0/24 
>>>>>>> osd mon heartbeat interval = 30 # Performance tuning filestore 
>>>>>>> merge threshold = 40 filestore split multiple = 8 osd op threads 
>>>>>>> = 8 # Recovery tuning osd recovery max active = 1 osd max 
>>>>>>> backfills = 1 osd recovery op priority = 1 
>>>>>>> 
>>>>>>> 
>>>>>>> debug lockdep = 0/0 
>>>>>>> debug context = 0/0 
>>>>>>> debug crush = 0/0 
>>>>>>> debug buffer = 0/0 
>>>>>>> debug timer = 0/0 
>>>>>>> debug journaler = 0/0 
>>>>>>> debug osd = 0/0 
>>>>>>> debug optracker = 0/0 
>>>>>>> debug objclass = 0/0 
>>>>>>> debug filestore = 0/0 
>>>>>>> debug journal = 0/0 
>>>>>>> debug ms = 0/0 
>>>>>>> debug monc = 0/0 
>>>>>>> debug tp = 0/0 
>>>>>>> debug auth = 0/0 
>>>>>>> debug finisher = 0/0 
>>>>>>> debug heartbeatmap = 0/0 
>>>>>>> debug perfcounter = 0/0 
>>>>>>> debug asok = 0/0 
>>>>>>> debug throttle = 0/0 
>>>>>>> 
>>>>>>> Disabling all debugging made me win 200/300 more IOPS. 
>>>>>>> 
>>>>>>> See my fio template: 
>>>>>>> 
>>>>>>> [global] 
>>>>>>> #logging 
>>>>>>> #write_iops_log=write_iops_log 
>>>>>>> #write_bw_log=write_bw_log 
>>>>>>> #write_lat_log=write_lat_lo 
>>>>>>> 
>>>>>>> time_based 
>>>>>>> runtime=60 
>>>>>>> 
>>>>>>> ioengine=rbd 
>>>>>>> clientname=admin 
>>>>>>> pool=test 
>>>>>>> rbdname=fio 
>>>>>>> invalidate=0 # mandatory 
>>>>>>> #rw=randwrite 
>>>>>>> rw=write 
>>>>>>> bs=4k 
>>>>>>> #bs=32m 
>>>>>>> size=5G 
>>>>>>> group_reporting 
>>>>>>> 
>>>>>>> [rbd_iodepth32] 
>>>>>>> iodepth=32 
>>>>>>> direct=1 
>>>>>>> 
>>>>>>> See my rio output: 
>>>>>>> 
>>>>>>> rbd_iodepth32: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, 
>>>>>>> ioengine=rbd, iodepth=32 fio-2.1.11-14-gb74e Starting 1 process 
>>>>>>> rbd engine: RBD version: 0.1.8 
>>>>>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/12876KB/0KB /s] 
>>>>>>> [0/3219/0 iops] [eta 00m:00s] 
>>>>>>> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=32116: Thu Aug 28 
>>>>>>> 00:28:26 2014 
>>>>>>> write: io=771448KB, bw=12855KB/s, iops=3213, runt= 60010msec slat 
>>>>>>> (usec): min=42, max=1578, avg=66.50, stdev=16.96 clat (msec): 
>>>>>>> min=1, max=28, avg= 9.85, stdev= 1.48 lat (msec): min=1, max=28, 
>>>>>>> avg= 9.92, stdev= 1.47 clat percentiles (usec): 
>>>>>>> | 1.00th=[ 6368], 5.00th=[ 8256], 10.00th=[ 8640], 20.00th=[ 
>>>>>>> | 9152], 30.00th=[ 9408], 40.00th=[ 9664], 50.00th=[ 9792], 
>>>>>>> | 60.00th=[10048], 70.00th=[10176], 80.00th=[10560], 
>>>>>>> | 90.00th=[10944], 95.00th=[11456], 99.00th=[13120], 
>>>>>>> | 99.50th=[16768], 99.90th=[25984], 99.95th=[27008], 
>>>>>>> | 99.99th=[28032] 
>>>>>>> bw (KB /s): min=11864, max=13808, per=100.00%, avg=12864.36, 
>>>>>>> stdev=407.35 lat (msec) : 2=0.03%, 4=0.54%, 10=59.79%, 20=39.24%, 
>>>>>>> 50=0.41% cpu : usr=19.15%, sys=4.69%, ctx=326309, majf=0, 
>>>>>>> minf=426088 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=33.9%, 
>>>>>>> 32=66.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 
>>>>>>> 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=99.6%, 8=0.4%, 
>>>>>>> 16=0.1%, 32=0.1%, 64=0.0%, >=64=0.0% issued : 
>>>>>>> total=r=0/w=192862/d=0, short=r=0/w=0/d=0 latency : target=0, 
>>>>>>> window=0, percentile=100.00%, depth=32 
>>>>>>> 
>>>>>>> Run status group 0 (all jobs): 
>>>>>>> WRITE: io=771448KB, aggrb=12855KB/s, minb=12855KB/s, 
>>>>>>> maxb=12855KB/s, mint=60010msec, maxt=60010msec 
>>>>>>> 
>>>>>>> Disk stats (read/write): 
>>>>>>> dm-1: ios=0/49, merge=0/0, ticks=0/12, in_queue=12, util=0.01%, 
>>>>>>> aggrios=0/22, aggrmerge=0/27, aggrticks=0/12, aggrin_queue=12, 
>>>>>>> aggrutil=0.01% 
>>>>>>> sda: ios=0/22, merge=0/27, ticks=0/12, in_queue=12, util=0.01% 
>>>>>>> 
>>>>>>> I tried to tweak several parameters like: 
>>>>>>> 
>>>>>>> filestore_wbthrottle_xfs_ios_start_flusher = 10000 
>>>>>>> filestore_wbthrottle_xfs_ios_hard_limit = 10000 
>>>>>>> filestore_wbthrottle_btrfs_ios_start_flusher = 10000 
>>>>>>> filestore_wbthrottle_btrfs_ios_hard_limit = 10000 filestore queue 
>>>>>>> max ops = 2000 
>>>>>>> 
>>>>>>> But didn't any improvement. 
>>>>>>> 
>>>>>>> Then I tried other things: 
>>>>>>> 
>>>>>>> * Increasing the io_depth up to 256 or 512 gave me between 50 to 100 
>>>>>>> more IOPS but it's not a realistic workload anymore and not that 
>>>>>>> significant. 
>>>>>>> * adding another SSD for the journal, still getting 3,2K IOPS 
>>>>>>> * I tried with rbd bench and I also got 3K IOPS 
>>>>>>> * I ran the test on a client machine and then locally on the 
>>>>>>> server, still getting 3,2K IOPS 
>>>>>>> * put the journal in memory, still getting 3,2K IOPS 
>>>>>>> * with 2 clients running the test in parallel I got a total of 
>>>>>>> 3,6K IOPS but I don't seem to be able to go over 
>>>>>>> * I tried is to add another OSD to that SSD, so I had 2 OSD and 2 
>>>>>>> journals on 1 SSD, got 4,5K IOPS YAY! 
>>>>>>> 
>>>>>>> Given the results of the last time it seems that something is limiting 
>>>>>>> the number of IOPS per OSD process. 
>>>>>>> 
>>>>>>> Running the test on a client or locally didn't show any difference. 
>>>>>>> So it looks to me that there is some contention within Ceph that might 
>>>>>>> cause this. 
>>>>>>> 
>>>>>>> I also ran perf and looked at the output, everything looks decent, but 
>>>>>>> someone might want to have a look at it :). 
>>>>>>> 
>>>>>>> We have been able to reproduce this on 3 distinct platforms with some 
>>>>>>> deviations (because of the hardware) but the behaviour is the same. 
>>>>>>> Any thoughts will be highly appreciated, only getting 3,2k out of an 
>>>>>>> 29K IOPS SSD is a bit frustrating :). 
>>>>>>> 
>>>>>>> Cheers. 
>>>>>>> ---- 
>>>>>>> Sébastien Han 
>>>>>>> Cloud Architect 
>>>>>>> 
>>>>>>> "Always give 100%. Unless you're giving blood." 
>>>>>>> 
>>>>>>> Phone: +33 (0)1 49 70 99 72 
>>>>>>> Mail: 
>>>>>>> sebastien....@enovance.com 
>>>>>>> 
>>>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : 
>>>>>>> www.enovance.com 
>>>>>>> - Twitter : @enovance 
>>>>>>> 
>>>>>>> 
>>>>>> Cheers. 
>>>>>> ---- 
>>>>>> Sébastien Han 
>>>>>> Cloud Architect 
>>>>>> 
>>>>>> "Always give 100%. Unless you're giving blood." 
>>>>>> 
>>>>>> Phone: +33 (0)1 49 70 99 72 
>>>>>> Mail: 
>>>>>> sebastien....@enovance.com 
>>>>>> 
>>>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : 
>>>>>> www.enovance.com 
>>>>>> - Twitter : @enovance 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________ 
>>>>>> ceph-users mailing list 
>>>>>> 
>>>>>> ceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-u 
>>>>>> sers-ceph.com 
>>>>> -- 
>>>>> Cédric 
>>>>> 
>>>>> _______________________________________________ 
>>>>> ceph-users mailing list 
>>>>> ceph-users@lists.ceph.com 
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> Cheers. 
>>>> ---- 
>>>> Sébastien Han 
>>>> Cloud Architect 
>>>> 
>>>> "Always give 100%. Unless you're giving blood." 
>>>> 
>>>> Phone: +33 (0)1 49 70 99 72 
>>>> Mail: sebastien....@enovance.com 
>>>> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com 
>>>> - Twitter : @enovance 
>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> -- 
>> Cédric 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> Cheers. 
> ---- 
> Sébastien Han 
> Cloud Architect 
> 
> "Always give 100%. Unless you're giving blood." 
> 
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien....@enovance.com 
> Address : 11 bis, rue Roquépine - 75008 Paris Web : www.enovance.com - 
> Twitter : @enovance 
> 


Cheers. 
–––– 
Sébastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood." 

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien....@enovance.com 
Address : 11 bis, rue Roquépine - 75008 Paris 
Web : www.enovance.com - Twitter : @enovance 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

Reply via email to