Re: [ceph-users] How to detect journal problems

Christian Balzer Wed, 09 Apr 2014 03:06:27 -0700

On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote:

> On Tuesday, April 8, 2014, Christian Balzer <ch...@gol.com> wrote:
> 
> > On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote:
> > >
> > > On 08/04/14 10:39, Christian Balzer wrote:
> > > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote:
> > > >
> > > >> On 08/04/14 10:04, Christian Balzer wrote:
> > > >>> Hello,
> > > >>>
> > > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote:
> > > >>>
> > > >>>> Hi all,
> > > >>>>
> > > >>>> I am currently benchmarking a standard setup with Intel DC S3700
> > > >>>> disks as journals and Hitachi 4TB-disks as data-drives, all on
> > > >>>> LACP 10GbE network.
> > > >>>>
> > > >>> Unless that is the 400GB version of the DC3700, you're already
> > > >>> limiting yourself to 365MB/s throughput with the 200GB variant.
> > > >>> If sequential write speed is that important to you and you think
> > > >>> you'll ever get those 5 HDs to write at full speed with Ceph
> > > >>> (unlikely).
> > > >> It's the 400GB version of the DC3700, and yes, I'm aware that I
> > > >> need a 1:3 ratio to max out these disks, as they write sequential
> > > >> data at about 150MB/s.
> > > >> But our thoughts are that it would cover the current demand with
> > > >> a 1:5 ratio, but we could upgrade.
> > > > I'd reckon you'll do fine, as in run out of steam and IOPS before
> > > > hitting that limit.
> > > >
> > > >>>> The size of my journals are 25GB each, and I have two journals
> > > >>>> per machine, with 5 OSDs per journal, with 5 machines in total.
> > > >>>> We currently use the tunables optimal and the version of ceph
> > > >>>> is the latest dumpling.
> > > >>>>
> > > >>>> Benchmarking writes with rbd show that there's no problem
> > > >>>> hitting upper levels on the 4TB-disks with sequential data,
> > > >>>> thus maxing out 10GbE. At this moment we see full utilization
> > > >>>> on the journals.
> > > >>>>
> > > >>>> But lowering the byte-size to 4k shows that the journals are
> > > >>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p <pool>
> > > >>>> -b 4096 -t 256 100 write)
> > > >>>>
> > > >>> When you're saying utilization I assume you're talking about
> > > >>> iostat or atop output?
> > > >> Yes, the utilization is iostat.
> > > >>> That's not a bug, that's comparing apple to oranges.
> > > >> You mean comparing iostat-results with the ones from rados
> > > >> benchmark?
> > > >>> The rados bench default is 4MB, which not only happens to be the
> > > >>> default RBD objectsize but also to generate a nice amount of
> > > >>> bandwidth.
> > > >>>
> > > >>> While at 4k writes your SDD is obviously bored, but actual OSD
> > > >>> needs to handle all those writes and becomes limited by the IOPS
> > > >>> it can peform.
> > > >> Yes, it's quite bored and just shuffles data.
> > > >> Maybe I've been thinking about this the wrong way,
> > > >> but shouldn't the Journal buffer more until the Journal partition
> > > >> is full or when the flush_interval is met.
> > > >>
> > > > Take a look at "journal queue max ops", which has a default of a
> > > > mere 500, so that's full after 2 seconds. ^o^
> > > Hm, that makes sense.
> > >
> > > So, tested out both low values ( 5000 )  and large value ( 6553600 ),
> > > but it didn't seem that change anything.
> > > Any chance I could dump the current values from a running OSD, to
> > > actually see what is in use?
> > >
> > The value can be checked like this (for example):
> > ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
> >
> > If you restarted your OSD after updating ceph.conf I'm sure you will
> > find the values you set there.
> >
> > However you are seriously underestimating the packet storm you're
> > unleashing with 256 threads of 4KB packets over a 10Gb/s link.
> >
> > That's theoretically 256K packets/s, very quickly filling even your
> > "large" max ops setting.
> > Also the "journal max write entries" will need to be adjusted to suit
> > the abilities (speed and merge wise) of your OSDs.
> >
> > With 40 million max ops and 2048 max write I get this (instead of
> > similar values to you with the defaults):
> >
> >      1     256      2963      2707   10.5707   10.5742  0.125177
> > 0.0830565 2     256      5278      5022   9.80635   9.04297  0.247757
> > 0.0968146 3     256      7276      7020   9.13867   7.80469  0.002813
> > 0.0994022 4     256      8774      8518   8.31665   5.85156  0.002976
> > 0.107339 5     256     10121      9865   7.70548   5.26172  0.002569
> > 0.117767 6     256     11363     11107   7.22969   4.85156   0.38909
> > 0.130649 7     256     12354     12098    6.7498   3.87109  0.002857
> > 0.137199 8     256     12392     12136   5.92465  0.148438   1.15075
> > 0.138359 9     256     12551     12295   5.33538  0.621094  0.003575
> > 0.151978 10     256     13099     12843    5.0159   2.14062
> > 0.146283   0.17639
> >
> > Of course this tails off eventually, but the effect is obvious and the
> > bandwidth is double that of the default values.
> >
> > I'm sure some inktank person will pipe up momentarily as to why these
> > defaults were chosen and why such huge values are to be avoided. ^.-
> >
> 
> Just from skimming, those numbers do look a little low, but I'm not sure
> how all the latencies work out.
> 
> Anyway, the reason we chose the low numbers is to avoid overloading a
> backing hard drive, which is going to have a lot more trouble than the
> journal with a huge backlog of ops. You'll want to test your small IO
> results for a very long time/with a fairly small journal to check that
> you don't get a square wave of throughput when waiting for the backing
> disk to commit everything to disk.
> 
I assume that's the same reason for the default values of these parameters?


  "journal_max_write_bytes": "10485760",
  "journal_queue_max_bytes": "33554432",

A mere 10 and 32MB.

According to the documentation I read this as no more than 10MB per write
to the filestore and no more than 32MB in the queue ever.
The queue being the entire journal or a per client/connection thing?

If the entire journal, why do people use 10GB or in my case 40GB journals?
^o^

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to detect journal problems

Reply via email to