Very nice. You're my hero! Shinobu
----- Original Message ----- From: "GuangYang" <yguan...@outlook.com> To: "Shinobu Kinjo" <ski...@redhat.com> Cc: "Ben Hines" <bhi...@gmail.com>, "Nick Fisk" <n...@fisk.me.uk>, "ceph-users" <ceph-users@lists.ceph.com> Sent: Saturday, September 5, 2015 9:40:06 AM Subject: RE: [ceph-users] Ceph performance, empty vs part full ---------------------------------------- > Date: Fri, 4 Sep 2015 20:31:59 -0400 > From: ski...@redhat.com > To: yguan...@outlook.com > CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Ceph performance, empty vs part full > >> IIRC, it only triggers the move (merge or split) when that folder is hit by >> a request, so most likely it happens gradually. > > Do you know what causes this? A requests (read/write/setxattr, etc) hitting objects in that folder. > I would like to be more clear "gradually". > > Shinobu > > ----- Original Message ----- > From: "GuangYang" <yguan...@outlook.com> > To: "Ben Hines" <bhi...@gmail.com>, "Nick Fisk" <n...@fisk.me.uk> > Cc: "ceph-users" <ceph-users@lists.ceph.com> > Sent: Saturday, September 5, 2015 9:27:31 AM > Subject: Re: [ceph-users] Ceph performance, empty vs part full > > IIRC, it only triggers the move (merge or split) when that folder is hit by a > request, so most likely it happens gradually. > > Another thing might be helpful (and we have had good experience with), is > that we do the folder splitting at the pool creation time, so that we avoid > the performance impact with runtime splitting (which is high if you have a > large cluster). In order to do that: > > 1. You will need to configure "filestore merge threshold" with a negative > value so that it disables merging. > 2. When creating the pool, there is a parameter named "expected_num_objects", > by specifying that number, the folder will splitted to the right level with > the pool creation. > > Hope that helps. > > Thanks, > Guang > > > ---------------------------------------- >> From: bhi...@gmail.com >> Date: Fri, 4 Sep 2015 12:05:26 -0700 >> To: n...@fisk.me.uk >> CC: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Ceph performance, empty vs part full >> >> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file >> a ticket to request a way to tell an OSD to rebalance its directory >> structure. >> >> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <n...@fisk.me.uk> wrote: >>> I've just made the same change ( 4 and 40 for now) on my cluster which is a >>> similar size to yours. I didn't see any merging happening, although most of >>> the directory's I looked at had more files in than the new merge threshold, >>> so I guess this is to be expected >>> >>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to >>> bring things back into order. >>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>>> Wang, Warren >>>> Sent: 04 September 2015 01:21 >>>> To: Mark Nelson <mnel...@redhat.com>; Ben Hines <bhi...@gmail.com> >>>> Cc: ceph-users <ceph-users@lists.ceph.com> >>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full >>>> >>>> I'm about to change it on a big cluster too. It totals around 30 million, >>>> so I'm a >>>> bit nervous on changing it. As far as I understood, it would indeed move >>>> them around, if you can get underneath the threshold, but it may be hard to >>>> do. Two more settings that I highly recommend changing on a big prod >>>> cluster. I'm in favor of bumping these two up in the defaults. >>>> >>>> Warren >>>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>>> Mark Nelson >>>> Sent: Thursday, September 03, 2015 6:04 PM >>>> To: Ben Hines <bhi...@gmail.com> >>>> Cc: ceph-users <ceph-users@lists.ceph.com> >>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full >>>> >>>> Hrm, I think it will follow the merge/split rules if it's out of whack >>>> given the >>>> new settings, but I don't know that I've ever tested it on an existing >>>> cluster to >>>> see that it actually happens. I guess let it sit for a while and then >>>> check the >>>> OSD PG directories to see if the object counts make sense given the new >>>> settings? :D >>>> >>>> Mark >>>> >>>> On 09/03/2015 04:31 PM, Ben Hines wrote: >>>>> Hey Mark, >>>>> >>>>> I've just tweaked these filestore settings for my cluster -- after >>>>> changing this, is there a way to make ceph move existing objects >>>>> around to new filestore locations, or will this only apply to newly >>>>> created objects? (i would assume the latter..) >>>>> >>>>> thanks, >>>>> >>>>> -Ben >>>>> >>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson <mnel...@redhat.com> >>>> wrote: >>>>>> Basically for each PG, there's a directory tree where only a certain >>>>>> number of objects are allowed in a given directory before it splits >>>>>> into new branches/leaves. The problem is that this has a fair amount >>>>>> of overhead and also there's extra associated dentry lookups to get at >>>>>> any >>>> given object. >>>>>> >>>>>> You may want to try something like: >>>>>> >>>>>> "filestore merge threshold = 40" >>>>>> "filestore split multiple = 8" >>>>>> >>>>>> This will dramatically increase the number of objects per directory >>>> allowed. >>>>>> >>>>>> Another thing you may want to try is telling the kernel to greatly >>>>>> favor retaining dentries and inodes in cache: >>>>>> >>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote: >>>>>>> >>>>>>> If I create a new pool it is generally fast for a short amount of time. >>>>>>> Not as fast as if I had a blank cluster, but close to. >>>>>>> >>>>>>> Bryn >>>>>>>> >>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <g...@gregs42.com> wrote: >>>>>>>> >>>>>>>> I think you're probably running into the internal PG/collection >>>>>>>> splitting here; try searching for those terms and seeing what your >>>>>>>> OSD folder structures look like. You could test by creating a new >>>>>>>> pool and seeing if it's faster or slower than the one you've already >>>>>>>> filled >>>> up. >>>>>>>> -Greg >>>>>>>> >>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn) >>>>>>>> <bryn.math...@alcatel-lucent.com> wrote: >>>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> >>>>>>>>> I’m perf testing a cluster again, >>>>>>>>> This time I have re-built the cluster and am filling it for testing. >>>>>>>>> >>>>>>>>> on a 10 min run I get the following results from 5 load >>>>>>>>> generators, each writing though 7 iocontexts, with a queue depth of >>>> 50 async writes. >>>>>>>>> >>>>>>>>> >>>>>>>>> Gen1 >>>>>>>>> Percentile 100 = 0.729775905609 >>>>>>>>> Max latencies = 0.729775905609, Min = 0.0320818424225, mean = >>>>>>>>> 0.0750389684542 >>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives >>>>>>>>> 187.151307376/s (748.605229503 MB/s) >>>>>>>>> >>>>>>>>> Gen2 >>>>>>>>> Percentile 100 = 0.735981941223 >>>>>>>>> Max latencies = 0.735981941223, Min = 0.0340068340302, mean = >>>>>>>>> 0.0745198070711 >>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives >>>>>>>>> 188.310495407/s (753.241981627 MB/s) >>>>>>>>> >>>>>>>>> Gen3 >>>>>>>>> Percentile 100 = 0.828994989395 >>>>>>>>> Max latencies = 0.828994989395, Min = 0.0349340438843, mean = >>>>>>>>> 0.0745455575197 >>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives >>>>>>>>> 188.085694736/s (752.342778944 MB/s) >>>>>>>>> >>>>>>>>> Gen4 >>>>>>>>> Percentile 100 = 1.06834602356 >>>>>>>>> Max latencies = 1.06834602356, Min = 0.0333499908447, mean = >>>>>>>>> 0.0752239764659 >>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives >>>>>>>>> 186.536020849/s (746.144083397 MB/s) >>>>>>>>> >>>>>>>>> Gen5 >>>>>>>>> Percentile 100 = 0.609658002853 >>>>>>>>> Max latencies = 0.609658002853, Min = 0.032968044281, mean = >>>>>>>>> 0.0744482759499 >>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives >>>>>>>>> 188.396498897/s (753.585995589 MB/s) >>>>>>>>> >>>>>>>>> example ceph -w output: >>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 pgs: 2880 >>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB avail; >>>>>>>>> active+2185 MB/s >>>>>>>>> wr, 572 op/s >>>>>>>>> >>>>>>>>> >>>>>>>>> However when the cluster gets over 20% full I see the following >>>>>>>>> results, this gets worse as the cluster fills up: >>>>>>>>> >>>>>>>>> Gen1 >>>>>>>>> Percentile 100 = 6.71176099777 >>>>>>>>> Max latencies = 6.71176099777, Min = 0.0358741283417, mean = >>>>>>>>> 0.161760483485 >>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives >>>>>>>>> 86.347386648/s >>>>>>>>> (345.389546592 MB/s) >>>>>>>>> >>>>>>>>> Gen2 >>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean = >>>>>>>>> 0.163243938477 >>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives >>>>>>>>> 85.5941313704/s (342.376525482 MB/s) >>>>>>>>> >>>>>>>>> Gen3 >>>>>>>>> Percentile 100 = 7.32526683807 >>>>>>>>> Max latencies = 7.32526683807, Min = 0.0366668701172, mean = >>>>>>>>> 0.163992217926 >>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives >>>>>>>>> 85.1287189397/s (340.514875759 MB/s) >>>>>>>>> >>>>>>>>> Gen4 >>>>>>>>> Percentile 100 = 7.56094503403 >>>>>>>>> Max latencies = 7.56094503403, Min = 0.0355761051178, mean = >>>>>>>>> 0.162109421231 >>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives >>>>>>>>> 86.1352376642/s (344.540950657 MB/s) >>>>>>>>> >>>>>>>>> >>>>>>>>> Gen5 >>>>>>>>> Percentile 100 = 6.99595499039 >>>>>>>>> Max latencies = 6.99595499039, Min = 0.0364680290222, mean = >>>>>>>>> 0.163651215426 >>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives >>>>>>>>> 85.3654127404/s (341.461650961 MB/s) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Cluster details: >>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s >>>>>>>>> 128Gb Ram >>>>>>>>> 2*intel 2620v3 >>>>>>>>> 10 Gbit Ceph public network >>>>>>>>> 10 Gbit Ceph private network >>>>>>>>> >>>>>>>>> Load generators connected via a 20Gbit bond to the ceph public >>>> network. >>>>>>>>> >>>>>>>>> >>>>>>>>> Is this likely to be something happening to the journals? >>>>>>>>> >>>>>>>>> Or is there something else going on. >>>>>>>>> >>>>>>>>> I have run FIO and iperf tests and the disk and network >>>>>>>>> performance is very high. >>>>>>>>> >>>>>>>>> >>>>>>>>> Kind Regards, >>>>>>>>> Bryn Mathias >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com