Re: [ceph-users] Investigating my 100 IOPS limit

Alexandre DERUMIER Thu, 09 Jul 2015 09:31:21 -0700

>>That’s very strange. Is nothing else using the disks?
no. only the fio benchmark.


>>The difference between noop and cfq should be (and in my experience is) 
>>marginal for such a benchmark.
maybe a bug in cfq (kernel 3.16 debian jessie) ? also, deadline scheduler give 
me same perf than noop.


----- Mail original -----
De: "Jan Schermer" <j...@schermer.cz>
À: "aderumier" <aderum...@odiso.com>
Cc: "Somnath Roy" <somnath....@sandisk.com>, "ceph-users" 
<ceph-users@lists.ceph.com>
Envoyé: Jeudi 9 Juillet 2015 18:20:51
Objet: Re: [ceph-users] Investigating my 100 IOPS limit

That’s very strange. Is nothing else using the disks? 
The difference between noop and cfq should be (and in my experience is) 
marginal for such a benchmark. 

Jan 


> On 09 Jul 2015, at 18:11, Alexandre DERUMIER <aderum...@odiso.com> wrote: 
> 
> Hi again, 
> 
> I totally forgot to check the io scheduler from my last tests, this was with 
> cfq. 
> 
> with noop scheduler, I have a huge difference 
> 
> cfq: 
> 
> - sequential syncronous 4k write iodepth=1 : 60 iops 
> - sequential syncronous 4k write iodepth=32 : 2000 iops 
> 
> 
> noop: 
> 
> - sequential syncronous 4k write iodepth=1 : 7866 iops 
> - sequential syncronous 4k write iodepth=32 : 34303 iops 
> 
> 
> ----- Mail original ----- 
> De: "Somnath Roy" <somnath....@sandisk.com> 
> À: "Jan Schermer" <j...@schermer.cz>, "aderumier" <aderum...@odiso.com> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Jeudi 9 Juillet 2015 17:46:41 
> Objet: RE: [ceph-users] Investigating my 100 IOPS limit 
> 
> I am not sure how increasing iodepth for sync write is giving you better 
> result..sync fio engine supposed to be always using iodepth =1. 
> BTW, I faced similar issues sometimes back,..By running the following fio job 
> file, I was getting very dismal performance on my SSD on top of XFS.. 
> 
> [random-write] 
> directory=/mnt/fio_test 
> rw=randwrite 
> bs=16k 
> direct=1 
> sync=1 
> time_based 
> runtime=1m 
> size=700G 
> group_reporting 
> 
> Result : 
> -------- 
> IOPS = 420 
> 
> lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01% 
> lat (msec) : 2=20.05%, 4=46.64%, 10=8.68% 
> 
> 
> Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this 
> pattern (even directly with block device, without any XFS) because they don't 
> handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by 
> digging into ceph mail list. That's why not all SSDs behave well with Ceph 
> journal.. 
> 
> Thanks & Regards 
> Somnath 
> 
> -----Original Message----- 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer 
> Sent: Thursday, July 09, 2015 8:24 AM 
> To: Alexandre DERUMIER 
> Cc: ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> Those are very strange numbers. Is the “60” figure right? 
> 
> Can you paste the full fio command and output? 
> Thanks 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:58, Alexandre DERUMIER <aderum...@odiso.com> wrote: 
>> 
>> I just tried on an intel s3700, on top of xfs 
>> 
>> fio , with 
>> - sequential syncronous 4k write iodepth=1 : 60 iops 
>> - sequential syncronous 4k write iodepth=32 : 2000 iops 
>> - random syncronous 4k write, iodepth=1 : 8000iops 
>> - random syncronous 4k write iodepth=32 : 18000 iops 
>> 
>> 
>> 
>> ----- Mail original ----- 
>> De: "aderumier" <aderum...@odiso.com> 
>> À: "Jan Schermer" <j...@schermer.cz> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Jeudi 9 Juillet 2015 15:50:35 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>>>> Any ideas where to look? I was hoping blktrace would show what 
>>>> exactly is going on, but it just shows a synchronous write -> (10ms) 
>>>> -> completed 
>> 
>> which size is the write in this case ? 4K ? or more ? 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Jan Schermer" <j...@schermer.cz> 
>> À: "aderumier" <aderum...@odiso.com> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never 
>> get the same 10ms latency. Must be something the filesystem journal/log does 
>> that is special. 
>> 
>> Any ideas where to look? I was hoping blktrace would show what exactly 
>> is going on, but it just shows a synchronous write -> (10ms) -> 
>> completed 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <aderum...@odiso.com> wrote: 
>>> 
>>>>> I have 12K IOPS in this test on the block device itself. But only 
>>>>> 100 filesystem transactions (=IOPS) on filesystem on the same 
>>>>> device because the “flush” (=FUA?) operation takes 10ms to finish. 
>>>>> I just can’t replicate the >>same “flush” operation with fio on the 
>>>>> block device, unfortunately, so I have no idea what is causing that 
>>>>> :/ 
>>> 
>>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>>> write. 
>>> 
>>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>>> after file write. 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Jan Schermer" <j...@schermer.cz> 
>>> À: "aderumier" <aderum...@odiso.com> 
>>> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
>>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>>> 
>>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>>> and higher have it for sure. 
>>> 
>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>> filesystem transactions (=IOPS) on filesystem on the same device 
>>> because the “flush” (=FUA?) operation takes 10ms to finish. I just 
>>> can’t replicate the same “flush” operation with fio on the block 
>>> device, unfortunately, so I have no idea what is causing that :/ 
>>> 
>>> Jan 
>>> 
>>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <aderum...@odiso.com> wrote: 
>>>> 
>>>> Hi, 
>>>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>>>> syncronous write. 
>>>> 
>>>> Not sure what model of ssd do you have ? 
>>>> 
>>>> see this: 
>>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your 
>>>> -ssd-is-suitable-as-a-journal-device/ 
>>>> 
>>>> what is your result of disk directly with 
>>>> 
>>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
>>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k 
>>>> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting 
>>>> --name=journal-test 
>>>> 
>>>> ? 
>>>> 
>>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), 
>>>> passthrough mode, and don't have any problem. 
>>>> 
>>>> 
>>>> also about centos 2.6.32, I'm not sure FUA support has been 
>>>> backported by redhat (since true FUA support is since 2.6.37), so maybe 
>>>> it's the old barrier code. 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Jan Schermer" <j...@schermer.cz> 
>>>> À: "ceph-users" <ceph-users@lists.ceph.com> 
>>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>>>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>>>> 
>>>> I hope this would be interesting for some, it nearly cost me my sanity. 
>>>> 
>>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” 
>>>> limit with the LSI controllers and some drives. 
>>>> It almost drove me crazy as I could replicate the problem with ease 
>>>> but when I wanted to show it to someone it was often gone. Sometimes 
>>>> it required fio to write for some time for the problem to manifest 
>>>> again, required seemingly conflicting settings to come up… 
>>>> 
>>>> Well, turns out the problem is fio calling fallocate() when creating the 
>>>> file to use for this test, which doesn’t really allocate the blocks, it 
>>>> just “reserves” them. 
>>>> When fio writes to those blocks, the filesystem journal becomes the 
>>>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>>>> 
>>>> If, however, I create the file with dd or such, those writes do _not_ end 
>>>> in the journal, and the result is 10K synchronous 4K IOPS on the same 
>>>> drive. 
>>>> If, for example, I run fio with a 1M block size, it would still do 100* 
>>>> IOPS and when I then run a 4K block size test without deleting the file, 
>>>> it would run at a 10K IOPS pace until it hits the first unwritten blocks - 
>>>> then it slows to a crawl again. 
>>>> 
>>>> The same issue is present with XFS and ext3/ext4 (with default mount 
>>>> options), and no matter how I create the filesystem or mount it can I 
>>>> avoid this problem. The only way to avoid this problem is to mount ext4 
>>>> with -o journal_async_commit, which should be safe, but... 
>>>> 
>>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
>>>> Kingston SSDs in this case (interestingly, this issue does not seem to 
>>>> occur on Samsung SSDs!). I think it has something to do with LSI faking a 
>>>> “FUA” support for the drives (AFAIK they don’t support it so the 
>>>> controller must somehow flush the cache, which is what introduces a huge 
>>>> latency hit). 
>>>> I can’t replicate this problem on the block device itself, only on a file 
>>>> on filesystem, so it might as well be a kernel/driver bug. I have a 
>>>> blktrace showing the difference between the “good” and “bad” writes, but I 
>>>> don’t know what the driver/controller does - I only see the write on the 
>>>> log device finishing after a long 10ms. 
>>>> 
>>>> Could someone tell me how CEPH creates the filesystem objects? I suppose 
>>>> it does fallocate() as well, right? Any way to force it to write them out 
>>>> completely and not use it to get around this issue I have? 
>>>> 
>>>> How to replicate: 
>>>> 
>>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write 
>>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting 
>>>> --name=journal-test --size=1000M --ioengine=libaio 
>>>> 
>>>> 
>>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
>>>> 
>>>> Jan 
>>>> _______________________________________________ 
>>>> ceph-users mailing list 
>>>> ceph-users@lists.ceph.com 
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ________________________________ 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies). 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Investigating my 100 IOPS limit

Reply via email to