>>That’s very strange. Is nothing else using the disks? no. only the fio benchmark.
>>The difference between noop and cfq should be (and in my experience is) >>marginal for such a benchmark. maybe a bug in cfq (kernel 3.16 debian jessie) ? also, deadline scheduler give me same perf than noop. ----- Mail original ----- De: "Jan Schermer" <j...@schermer.cz> À: "aderumier" <aderum...@odiso.com> Cc: "Somnath Roy" <somnath....@sandisk.com>, "ceph-users" <ceph-users@lists.ceph.com> Envoyé: Jeudi 9 Juillet 2015 18:20:51 Objet: Re: [ceph-users] Investigating my 100 IOPS limit That’s very strange. Is nothing else using the disks? The difference between noop and cfq should be (and in my experience is) marginal for such a benchmark. Jan > On 09 Jul 2015, at 18:11, Alexandre DERUMIER <aderum...@odiso.com> wrote: > > Hi again, > > I totally forgot to check the io scheduler from my last tests, this was with > cfq. > > with noop scheduler, I have a huge difference > > cfq: > > - sequential syncronous 4k write iodepth=1 : 60 iops > - sequential syncronous 4k write iodepth=32 : 2000 iops > > > noop: > > - sequential syncronous 4k write iodepth=1 : 7866 iops > - sequential syncronous 4k write iodepth=32 : 34303 iops > > > ----- Mail original ----- > De: "Somnath Roy" <somnath....@sandisk.com> > À: "Jan Schermer" <j...@schermer.cz>, "aderumier" <aderum...@odiso.com> > Cc: "ceph-users" <ceph-users@lists.ceph.com> > Envoyé: Jeudi 9 Juillet 2015 17:46:41 > Objet: RE: [ceph-users] Investigating my 100 IOPS limit > > I am not sure how increasing iodepth for sync write is giving you better > result..sync fio engine supposed to be always using iodepth =1. > BTW, I faced similar issues sometimes back,..By running the following fio job > file, I was getting very dismal performance on my SSD on top of XFS.. > > [random-write] > directory=/mnt/fio_test > rw=randwrite > bs=16k > direct=1 > sync=1 > time_based > runtime=1m > size=700G > group_reporting > > Result : > -------- > IOPS = 420 > > lat (usec) : 250=0.10%, 500=2.28%, 750=22.25%, 1000=0.01% > lat (msec) : 2=20.05%, 4=46.64%, 10=8.68% > > > Turned out that is a SSD FW problem...Some SSDs tend to misbehave in this > pattern (even directly with block device, without any XFS) because they don't > handle O_DIRECT|O_SYNC writes well..I am sure you will find some reference by > digging into ceph mail list. That's why not all SSDs behave well with Ceph > journal.. > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan > Schermer > Sent: Thursday, July 09, 2015 8:24 AM > To: Alexandre DERUMIER > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Investigating my 100 IOPS limit > > Those are very strange numbers. Is the “60” figure right? > > Can you paste the full fio command and output? > Thanks > > Jan > >> On 09 Jul 2015, at 15:58, Alexandre DERUMIER <aderum...@odiso.com> wrote: >> >> I just tried on an intel s3700, on top of xfs >> >> fio , with >> - sequential syncronous 4k write iodepth=1 : 60 iops >> - sequential syncronous 4k write iodepth=32 : 2000 iops >> - random syncronous 4k write, iodepth=1 : 8000iops >> - random syncronous 4k write iodepth=32 : 18000 iops >> >> >> >> ----- Mail original ----- >> De: "aderumier" <aderum...@odiso.com> >> À: "Jan Schermer" <j...@schermer.cz> >> Cc: "ceph-users" <ceph-users@lists.ceph.com> >> Envoyé: Jeudi 9 Juillet 2015 15:50:35 >> Objet: Re: [ceph-users] Investigating my 100 IOPS limit >> >>>> Any ideas where to look? I was hoping blktrace would show what >>>> exactly is going on, but it just shows a synchronous write -> (10ms) >>>> -> completed >> >> which size is the write in this case ? 4K ? or more ? >> >> >> ----- Mail original ----- >> De: "Jan Schermer" <j...@schermer.cz> >> À: "aderumier" <aderum...@odiso.com> >> Cc: "ceph-users" <ceph-users@lists.ceph.com> >> Envoyé: Jeudi 9 Juillet 2015 15:29:15 >> Objet: Re: [ceph-users] Investigating my 100 IOPS limit >> >> I tried everything: —write-barrier, —sync —fsync, —fdatasync I never >> get the same 10ms latency. Must be something the filesystem journal/log does >> that is special. >> >> Any ideas where to look? I was hoping blktrace would show what exactly >> is going on, but it just shows a synchronous write -> (10ms) -> >> completed >> >> Jan >> >>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <aderum...@odiso.com> wrote: >>> >>>>> I have 12K IOPS in this test on the block device itself. But only >>>>> 100 filesystem transactions (=IOPS) on filesystem on the same >>>>> device because the “flush” (=FUA?) operation takes 10ms to finish. >>>>> I just can’t replicate the >>same “flush” operation with fio on the >>>>> block device, unfortunately, so I have no idea what is causing that >>>>> :/ >>> >>> AFAIK, with fio on block device with --sync=1, is doing flush after each >>> write. >>> >>> I'm not sure with fio on a filesystem, but filesystem should do a fsync >>> after file write. >>> >>> >>> ----- Mail original ----- >>> De: "Jan Schermer" <j...@schermer.cz> >>> À: "aderumier" <aderum...@odiso.com> >>> Cc: "ceph-users" <ceph-users@lists.ceph.com> >>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 >>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit >>> >>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 >>> and higher have it for sure. >>> >>> I have 12K IOPS in this test on the block device itself. But only 100 >>> filesystem transactions (=IOPS) on filesystem on the same device >>> because the “flush” (=FUA?) operation takes 10ms to finish. I just >>> can’t replicate the same “flush” operation with fio on the block >>> device, unfortunately, so I have no idea what is causing that :/ >>> >>> Jan >>> >>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <aderum...@odiso.com> wrote: >>>> >>>> Hi, >>>> I have already see bad performance with Crucial m550 ssd, 400 iops >>>> syncronous write. >>>> >>>> Not sure what model of ssd do you have ? >>>> >>>> see this: >>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your >>>> -ssd-is-suitable-as-a-journal-device/ >>>> >>>> what is your result of disk directly with >>>> >>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync >>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k >>>> --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting >>>> --name=journal-test >>>> >>>> ? >>>> >>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), >>>> passthrough mode, and don't have any problem. >>>> >>>> >>>> also about centos 2.6.32, I'm not sure FUA support has been >>>> backported by redhat (since true FUA support is since 2.6.37), so maybe >>>> it's the old barrier code. >>>> >>>> >>>> ----- Mail original ----- >>>> De: "Jan Schermer" <j...@schermer.cz> >>>> À: "ceph-users" <ceph-users@lists.ceph.com> >>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 >>>> Objet: [ceph-users] Investigating my 100 IOPS limit >>>> >>>> I hope this would be interesting for some, it nearly cost me my sanity. >>>> >>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” >>>> limit with the LSI controllers and some drives. >>>> It almost drove me crazy as I could replicate the problem with ease >>>> but when I wanted to show it to someone it was often gone. Sometimes >>>> it required fio to write for some time for the problem to manifest >>>> again, required seemingly conflicting settings to come up… >>>> >>>> Well, turns out the problem is fio calling fallocate() when creating the >>>> file to use for this test, which doesn’t really allocate the blocks, it >>>> just “reserves” them. >>>> When fio writes to those blocks, the filesystem journal becomes the >>>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). >>>> >>>> If, however, I create the file with dd or such, those writes do _not_ end >>>> in the journal, and the result is 10K synchronous 4K IOPS on the same >>>> drive. >>>> If, for example, I run fio with a 1M block size, it would still do 100* >>>> IOPS and when I then run a 4K block size test without deleting the file, >>>> it would run at a 10K IOPS pace until it hits the first unwritten blocks - >>>> then it slows to a crawl again. >>>> >>>> The same issue is present with XFS and ext3/ext4 (with default mount >>>> options), and no matter how I create the filesystem or mount it can I >>>> avoid this problem. The only way to avoid this problem is to mount ext4 >>>> with -o journal_async_commit, which should be safe, but... >>>> >>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and >>>> Kingston SSDs in this case (interestingly, this issue does not seem to >>>> occur on Samsung SSDs!). I think it has something to do with LSI faking a >>>> “FUA” support for the drives (AFAIK they don’t support it so the >>>> controller must somehow flush the cache, which is what introduces a huge >>>> latency hit). >>>> I can’t replicate this problem on the block device itself, only on a file >>>> on filesystem, so it might as well be a kernel/driver bug. I have a >>>> blktrace showing the difference between the “good” and “bad” writes, but I >>>> don’t know what the driver/controller does - I only see the write on the >>>> log device finishing after a long 10ms. >>>> >>>> Could someone tell me how CEPH creates the filesystem objects? I suppose >>>> it does fallocate() as well, right? Any way to force it to write them out >>>> completely and not use it to get around this issue I have? >>>> >>>> How to replicate: >>>> >>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write >>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting >>>> --name=journal-test --size=1000M --ioengine=libaio >>>> >>>> >>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) >>>> >>>> Jan >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com