Re: [ceph-users] Poor performance on all SSD cluster

Christian Balzer Mon, 23 Jun 2014 19:36:25 -0700

Hello,

On Mon, 23 Jun 2014 10:26:32 -0700 Greg Poirier wrote:


> 10 OSDs per node
So 90 OSDs in total.

> 12 physical cores hyperthreaded (24 logical cores exposed to OS)
Sounds good.

> 64GB RAM
With SSDs the effect of a large pagecache on the storage nodes isn't that
pronounced, but still nice. ^^

> 
> Negligible load
> 
> iostat shows the disks are largely idle except for bursty writes
> occasionally.
> 
I suppose it is a bit of drag to monitor this on 9 nodes at the same
time but at least with atop it would be feasible. 
You might want to check if specific OSDs (both disks and processes) are
getting busy while others remain idle for the duration of the test.

As for the fio results, could you try it from a VM using userspace RBD as
well? 

But either way, that result for the host is horrible, but unfortunately
totally on par with what you saw from your dd test.
I would have expected a cluster like yours to produce up to 40k IOPS (yes,
the amount a single SSD can do) from my experience.

Something more than the inherent latency of Ceph (OSDs) seems to be going
on here.

Christian

> Results of fio from one of the SSDs in the cluster:
> 
> fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.1.3
> Starting 1 process
> fiojob: Laying out IO file(s) (1 file(s) / 400MB)
> Jobs: 1 (f=1): [w] [-.-% done] [0KB/155.5MB/0KB /s] [0/39.8K/0 iops] [eta
> 00m:00s]
> fiojob: (groupid=0, jobs=1): err= 0: pid=21845: Mon Jun 23 13:23:47 2014
>   write: io=409600KB, bw=157599KB/s, iops=39399, runt=  2599msec
>     slat (usec): min=6, max=2149, avg=22.13, stdev=23.08
>     clat (usec): min=70, max=10700, avg=3220.76, stdev=521.44
>      lat (usec): min=90, max=10722, avg=3243.13, stdev=523.70
>     clat percentiles (usec):
>      |  1.00th=[ 2736],  5.00th=[ 2864], 10.00th=[ 2896],
> 20.00th=[ 2928], | 30.00th=[ 2960], 40.00th=[ 3024], 50.00th=[ 3056],
> 60.00th=[ 3184], | 70.00th=[ 3344], 80.00th=[ 3440], 90.00th=[ 3504],
> 95.00th=[ 3632], | 99.00th=[ 5856], 99.50th=[ 6240], 99.90th=[ 7136],
> 99.95th=[ 7584], | 99.99th=[ 8160]
>     bw (KB  /s): min=139480, max=173320, per=99.99%, avg=157577.60,
> stdev=16122.77
>     lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.08%, 4=95.89%, 10=3.98%, 20=0.01%
>   cpu          : usr=14.05%, sys=46.73%, ctx=72243, majf=0, minf=186
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> >=64=99.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.1%
>      issued    : total=r=0/w=102400/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>   WRITE: io=409600KB, aggrb=157599KB/s, minb=157599KB/s, maxb=157599KB/s,
> mint=2599msec, maxt=2599msec
> 
> Disk stats (read/write):
>   sda: ios=0/95026, merge=0/0, ticks=0/3016, in_queue=2972, util=82.27%
> 
> All of the disks are identical.
> 
> The same fio from the host with the RBD volume mounted:
> 
> fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.1.3
> Starting 1 process
> fiojob: Laying out IO file(s) (1 file(s) / 400MB)
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/5384KB/0KB /s] [0/1346/0 iops] [eta
> 00m:00s]
> fiojob: (groupid=0, jobs=1): err= 0: pid=30070: Mon Jun 23 13:25:50 2014
>   write: io=409600KB, bw=9264.3KB/s, iops=2316, runt= 44213msec
>     slat (usec): min=17, max=154210, avg=84.83, stdev=535.40
>     clat (msec): min=10, max=1294, avg=55.17, stdev=103.43
>      lat (msec): min=10, max=1295, avg=55.25, stdev=103.43
>     clat percentiles (msec):
>      |  1.00th=[   17],  5.00th=[   21], 10.00th=[   24],
> 20.00th=[   28], | 30.00th=[   31], 40.00th=[   34], 50.00th=[   37],
> 60.00th=[   40], | 70.00th=[   44], 80.00th=[   50], 90.00th=[   63],
> 95.00th=[  103], | 99.00th=[  725], 99.50th=[  906], 99.90th=[ 1106],
> 99.95th=[ 1172], | 99.99th=[ 1237]
>     bw (KB  /s): min= 3857, max=12416, per=100.00%, avg=9280.09,
> stdev=1233.63
>     lat (msec) : 20=3.76%, 50=76.60%, 100=14.45%, 250=2.98%, 500=0.72%
>     lat (msec) : 750=0.56%, 1000=0.66%, 2000=0.27%
>   cpu          : usr=3.50%, sys=19.31%, ctx=131358, majf=0, minf=986
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> >=64=99.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.1%
>      issued    : total=r=0/w=102400/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>   WRITE: io=409600KB, aggrb=9264KB/s, minb=9264KB/s, maxb=9264KB/s,
> mint=44213msec, maxt=44213msec
> 
> Disk stats (read/write):
>   rbd2: ios=0/102499, merge=0/1818, ticks=0/5593828, in_queue=5599520,
> util=99.85%
> 
> 
> On Sun, Jun 22, 2014 at 6:42 PM, Christian Balzer <ch...@gol.com> wrote:
> 
> > On Sun, 22 Jun 2014 12:14:38 -0700 Greg Poirier wrote:
> >
> > > We actually do have a use pattern of large batch sequential writes,
> > > and this dd is pretty similar to that use case.
> > >
> > > A round-trip write with replication takes approximately 10-15ms to
> > > complete. I've been looking at dump_historic_ops on a number of OSDs
> > > and getting mean, min, and max for sub_op and ops. If these were on
> > > the order of 1-2 seconds, I could understand this throughput... But
> > > we're talking about fairly fast SSDs and a 20Gbps network with <1ms
> > > latency for TCP round-trip between the client machine and all of the
> > > OSD hosts.
> > >
> > > I've gone so far as disabling replication entirely (which had almost
> > > no impact) and putting journals on separate SSDs as the data disks
> > > (which are ALSO SSDs).
> > >
> > > This just doesn't make sense to me.
> > >
> > A lot of this sounds like my "Slow IOPS on RBD compared to journal and
> > backing devices" thread a few weeks ago.
> > Though those results are even worse in a way than what I saw.
> >
> > How many OSDs do you have per node and how many CPU cores?
> >
> > When running this test, are the OSDs very CPU intense?
> > Do you see a good spread amongst the OSDs or are there hotspots?
> >
> > If you have the time/chance, could you run the fio from that thread and
> > post the results, I'm very curious to find out if the no more than 400
> > IOPS per OSD holds true for your cluster as well.
> >
> > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> > --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
> >
> >
> > Regards,
> >
> > Christian
> >
> > >
> > > On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson
> > > <mark.nel...@inktank.com> wrote:
> > >
> > > > On 06/22/2014 02:02 AM, Haomai Wang wrote:
> > > >
> > > >> Hi Mark,
> > > >>
> > > >> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it
> > > >> seemed ok.
> > > >>
> > > >>  dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
> > > >>>
> > > >>
> > > >> 82.3MB/s
> > > >>
> > > >
> > > > RBD Cache is definitely going to help in this use case.  This test
> > > > is basically just sequentially writing a single 16k chunk of data
> > > > out, one at a time.  IE, entirely latency bound.  At least on OSDs
> > > > backed by XFS, you have to wait for that data to hit the journals
> > > > of every OSD associated with the object before the acknowledgement
> > > > gets sent back to the client.  If you are using the default 4MB
> > > > block size, you'll hit the same OSDs over and over again and your
> > > > other OSDs will sit there twiddling their thumbs waiting for IO
> > > > until you hit the next block, but then it will just be a different
> > > > set OSDs getting hit.  You should be able to verify this by using
> > > > iostat or collectl or something to look at the behaviour of the
> > > > SSDs during the test.  Since this is all sequential though,
> > > > switching to  buffered IO (ie coalesce IOs at the buffercache
> > > > layer) or using RBD cache for direct IO (coalesce IOs below the
> > > > block device) will dramatically improve things.
> > > >
> > > > The real question here though, is whether or not a synchronous
> > > > stream of sequential 16k writes is even remotely close to the IO
> > > > patterns that would be seen in actual use for MySQL.  Most likely
> > > > in actual use you'll be seeing a big mix of random reads and
> > > > writes, and hopefully at least some parallelism (though this
> > > > depends on the number of databases, number of users, and the
> > > > workload!).
> > > >
> > > > Ceph is pretty good at small random IO with lots of parallelism on
> > > > spinning disk backed OSDs (So long as you use SSD journals or
> > > > controllers with WB cache).  It's much harder to get native-level
> > > > IOPS rates with SSD backed OSDs though.  The latency involved in
> > > > distributing and processing all of that data becomes a much bigger
> > > > deal.  Having said that, we are actively working on improving
> > > > latency as much as we can. :)
> > > >
> > > > Mark
> > > >
> > > >
> > > >
> > > >>
> > > >> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
> > > >> <mark.kirkw...@catalyst.net.nz> wrote:
> > > >>
> > > >>> On 22/06/14 14:09, Mark Kirkwood wrote:
> > > >>>
> > > >>> Upgrading the VM to 14.04 and restesting the case *without*
> > > >>> direct I get:
> > > >>>
> > > >>> - 164 MB/s (librbd)
> > > >>> - 115 MB/s (kernel 3.13)
> > > >>>
> > > >>> So managing to almost get native performance out of the librbd
> > > >>> case. I tweaked both filestore max and min sync intervals (100
> > > >>> and 10 resp) just to
> > > >>> see if I could actually avoid writing to the spinners while the
> > > >>> test was in
> > > >>> progress (still seeing some, but clearly fewer).
> > > >>>
> > > >>> However no improvement at all *with* direct enabled. The output
> > > >>> of iostat on
> > > >>> the host while the direct test is in progress is interesting:
> > > >>>
> > > >>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > >>>            11.73    0.00    5.04    0.76    0.00   82.47
> > > >>>
> > > >>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> > > >>> avgrq-sz
> > > >>> avgqu-sz   await r_await w_await  svctm  %util
> > > >>> sda               0.00     0.00    0.00   11.00     0.00     4.02
> > > >>> 749.09 0.14   12.36    0.00   12.36   6.55   7.20
> > > >>> sdb               0.00     0.00    0.00   11.00     0.00     4.02
> > > >>> 749.09 0.14   12.36    0.00   12.36   5.82   6.40
> > > >>> sdc               0.00     0.00    0.00  435.00     0.00     4.29
> > > >>> 20.21 0.53    1.21    0.00    1.21   1.21  52.80
> > > >>> sdd               0.00     0.00    0.00  435.00     0.00     4.29
> > > >>> 20.21 0.52    1.20    0.00    1.20   1.20  52.40
> > > >>>
> > > >>> (sda,b are the spinners sdc,d the ssds). Something is making the
> > > >>> journal work very hard for its 4.29 MB/s!
> > > >>>
> > > >>> regards
> > > >>>
> > > >>> Mark
> > > >>>
> > > >>>
> > > >>>  Leaving
> > > >>>> off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s
> > > >>>> (kernel 3.11 [2]). The ssd's can do writes at about 180 MB/s
> > > >>>> each... which is something to look at another day[1].
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> ceph-users mailing list
> > > >>> ceph-users@lists.ceph.com
> > > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >>
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance on all SSD cluster

Reply via email to