On Tue, 8 Apr 2014 09:35:19 -0700 Gregory Farnum wrote: > On Tuesday, April 8, 2014, Christian Balzer <ch...@gol.com> wrote: > > > On Tue, 08 Apr 2014 14:19:20 +0200 Josef Johansson wrote: > > > > > > On 08/04/14 10:39, Christian Balzer wrote: > > > > On Tue, 08 Apr 2014 10:31:44 +0200 Josef Johansson wrote: > > > > > > > >> On 08/04/14 10:04, Christian Balzer wrote: > > > >>> Hello, > > > >>> > > > >>> On Tue, 08 Apr 2014 09:31:18 +0200 Josef Johansson wrote: > > > >>> > > > >>>> Hi all, > > > >>>> > > > >>>> I am currently benchmarking a standard setup with Intel DC S3700 > > > >>>> disks as journals and Hitachi 4TB-disks as data-drives, all on > > > >>>> LACP 10GbE network. > > > >>>> > > > >>> Unless that is the 400GB version of the DC3700, you're already > > > >>> limiting yourself to 365MB/s throughput with the 200GB variant. > > > >>> If sequential write speed is that important to you and you think > > > >>> you'll ever get those 5 HDs to write at full speed with Ceph > > > >>> (unlikely). > > > >> It's the 400GB version of the DC3700, and yes, I'm aware that I > > > >> need a 1:3 ratio to max out these disks, as they write sequential > > > >> data at about 150MB/s. > > > >> But our thoughts are that it would cover the current demand with > > > >> a 1:5 ratio, but we could upgrade. > > > > I'd reckon you'll do fine, as in run out of steam and IOPS before > > > > hitting that limit. > > > > > > > >>>> The size of my journals are 25GB each, and I have two journals > > > >>>> per machine, with 5 OSDs per journal, with 5 machines in total. > > > >>>> We currently use the tunables optimal and the version of ceph > > > >>>> is the latest dumpling. > > > >>>> > > > >>>> Benchmarking writes with rbd show that there's no problem > > > >>>> hitting upper levels on the 4TB-disks with sequential data, > > > >>>> thus maxing out 10GbE. At this moment we see full utilization > > > >>>> on the journals. > > > >>>> > > > >>>> But lowering the byte-size to 4k shows that the journals are > > > >>>> utilized to about 20%, and the 4TB-disks 100%. (rados -p <pool> > > > >>>> -b 4096 -t 256 100 write) > > > >>>> > > > >>> When you're saying utilization I assume you're talking about > > > >>> iostat or atop output? > > > >> Yes, the utilization is iostat. > > > >>> That's not a bug, that's comparing apple to oranges. > > > >> You mean comparing iostat-results with the ones from rados > > > >> benchmark? > > > >>> The rados bench default is 4MB, which not only happens to be the > > > >>> default RBD objectsize but also to generate a nice amount of > > > >>> bandwidth. > > > >>> > > > >>> While at 4k writes your SDD is obviously bored, but actual OSD > > > >>> needs to handle all those writes and becomes limited by the IOPS > > > >>> it can peform. > > > >> Yes, it's quite bored and just shuffles data. > > > >> Maybe I've been thinking about this the wrong way, > > > >> but shouldn't the Journal buffer more until the Journal partition > > > >> is full or when the flush_interval is met. > > > >> > > > > Take a look at "journal queue max ops", which has a default of a > > > > mere 500, so that's full after 2 seconds. ^o^ > > > Hm, that makes sense. > > > > > > So, tested out both low values ( 5000 ) and large value ( 6553600 ), > > > but it didn't seem that change anything. > > > Any chance I could dump the current values from a running OSD, to > > > actually see what is in use? > > > > > The value can be checked like this (for example): > > ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show > > > > If you restarted your OSD after updating ceph.conf I'm sure you will > > find the values you set there. > > > > However you are seriously underestimating the packet storm you're > > unleashing with 256 threads of 4KB packets over a 10Gb/s link. > > > > That's theoretically 256K packets/s, very quickly filling even your > > "large" max ops setting. > > Also the "journal max write entries" will need to be adjusted to suit > > the abilities (speed and merge wise) of your OSDs. > > > > With 40 million max ops and 2048 max write I get this (instead of > > similar values to you with the defaults): > > > > 1 256 2963 2707 10.5707 10.5742 0.125177 > > 0.0830565 2 256 5278 5022 9.80635 9.04297 0.247757 > > 0.0968146 3 256 7276 7020 9.13867 7.80469 0.002813 > > 0.0994022 4 256 8774 8518 8.31665 5.85156 0.002976 > > 0.107339 5 256 10121 9865 7.70548 5.26172 0.002569 > > 0.117767 6 256 11363 11107 7.22969 4.85156 0.38909 > > 0.130649 7 256 12354 12098 6.7498 3.87109 0.002857 > > 0.137199 8 256 12392 12136 5.92465 0.148438 1.15075 > > 0.138359 9 256 12551 12295 5.33538 0.621094 0.003575 > > 0.151978 10 256 13099 12843 5.0159 2.14062 > > 0.146283 0.17639 > > > > Of course this tails off eventually, but the effect is obvious and the > > bandwidth is double that of the default values. > > > > I'm sure some inktank person will pipe up momentarily as to why these > > defaults were chosen and why such huge values are to be avoided. ^.- > > > > Just from skimming, those numbers do look a little low, but I'm not sure > how all the latencies work out. > > Anyway, the reason we chose the low numbers is to avoid overloading a > backing hard drive, which is going to have a lot more trouble than the > journal with a huge backlog of ops. You'll want to test your small IO > results for a very long time/with a fairly small journal to check that > you don't get a square wave of throughput when waiting for the backing > disk to commit everything to disk. > I assume that's the same reason for the default values of these parameters?
"journal_max_write_bytes": "10485760", "journal_queue_max_bytes": "33554432", A mere 10 and 32MB. According to the documentation I read this as no more than 10MB per write to the filestore and no more than 32MB in the queue ever. The queue being the entire journal or a per client/connection thing? If the entire journal, why do people use 10GB or in my case 40GB journals? ^o^ Regards, Christian -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com