Re: [ceph-users] Investigating my 100 IOPS limit

Jan Schermer Thu, 09 Jul 2015 06:29:38 -0700

I tried everything: —write-barrier, —sync —fsync, —fdatasync
I never get the same 10ms latency. Must be something the filesystem journal/log 
does that is special.


Any ideas where to look? I was hoping blktrace would show what exactly is going 
on, but it just shows a synchronous write -> (10ms) -> completed

Jan

> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <aderum...@odiso.com> wrote:
> 
>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>> filesystem transactions (=IOPS) on filesystem on the same device because 
>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
>>> the >>same “flush” operation with fio on the block device, unfortunately, 
>>> so I have no idea what is causing that :/ 
> 
> AFAIK, with fio on block device with --sync=1, is doing flush after each 
> write.
> 
> I'm not sure with fio on a filesystem, but filesystem should do a fsync after 
> file write.
> 
> 
> ----- Mail original -----
> De: "Jan Schermer" <j...@schermer.cz>
> À: "aderumier" <aderum...@odiso.com>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>
> Envoyé: Jeudi 9 Juillet 2015 14:43:46
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> 
> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and 
> higher have it for sure. 
> 
> I have 12K IOPS in this test on the block device itself. But only 100 
> filesystem transactions (=IOPS) on filesystem on the same device because the 
> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
> same “flush” operation with fio on the block device, unfortunately, so I have 
> no idea what is causing that :/ 
> 
> Jan 
> 
>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <aderum...@odiso.com> wrote: 
>> 
>> Hi, 
>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>> syncronous write. 
>> 
>> Not sure what model of ssd do you have ? 
>> 
>> see this: 
>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>  
>> 
>> what is your result of disk directly with 
>> 
>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>> 
>> ? 
>> 
>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>> mode, and don't have any problem. 
>> 
>> 
>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>> redhat (since true FUA support is since 2.6.37), 
>> so maybe it's the old barrier code. 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Jan Schermer" <j...@schermer.cz> 
>> À: "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> I hope this would be interesting for some, it nearly cost me my sanity. 
>> 
>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>> with the LSI controllers and some drives. 
>> It almost drove me crazy as I could replicate the problem with ease but when 
>> I wanted to show it to someone it was often gone. Sometimes it required fio 
>> to write for some time for the problem to manifest again, required seemingly 
>> conflicting settings to come up… 
>> 
>> Well, turns out the problem is fio calling fallocate() when creating the 
>> file to use for this test, which doesn’t really allocate the blocks, it just 
>> “reserves” them. 
>> When fio writes to those blocks, the filesystem journal becomes the 
>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>> 
>> If, however, I create the file with dd or such, those writes do _not_ end in 
>> the journal, and the result is 10K synchronous 4K IOPS on the same drive. 
>> If, for example, I run fio with a 1M block size, it would still do 100* IOPS 
>> and when I then run a 4K block size test without deleting the file, it would 
>> run at a 10K IOPS pace until it hits the first unwritten blocks - then it 
>> slows to a crawl again. 
>> 
>> The same issue is present with XFS and ext3/ext4 (with default mount 
>> options), and no matter how I create the filesystem or mount it can I avoid 
>> this problem. The only way to avoid this problem is to mount ext4 with -o 
>> journal_async_commit, which should be safe, but... 
>> 
>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
>> Kingston SSDs in this case (interestingly, this issue does not seem to occur 
>> on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” 
>> support for the drives (AFAIK they don’t support it so the controller must 
>> somehow flush the cache, which is what introduces a huge latency hit). 
>> I can’t replicate this problem on the block device itself, only on a file on 
>> filesystem, so it might as well be a kernel/driver bug. I have a blktrace 
>> showing the difference between the “good” and “bad” writes, but I don’t know 
>> what the driver/controller does - I only see the write on the log device 
>> finishing after a long 10ms. 
>> 
>> Could someone tell me how CEPH creates the filesystem objects? I suppose it 
>> does fallocate() as well, right? Any way to force it to write them out 
>> completely and not use it to get around this issue I have? 
>> 
>> How to replicate: 
>> 
>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
>> --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test 
>> --size=1000M --ioengine=libaio 
>> 
>> 
>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
>> 
>> Jan 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Investigating my 100 IOPS limit

Reply via email to