Re: [ceph-users] Ceph performance, empty vs part full

Shinobu Kinjo Fri, 04 Sep 2015 17:42:46 -0700

Very nice.
You're my hero!

 Shinobu


----- Original Message -----
From: "GuangYang" <yguan...@outlook.com>
To: "Shinobu Kinjo" <ski...@redhat.com>
Cc: "Ben Hines" <bhi...@gmail.com>, "Nick Fisk" <n...@fisk.me.uk>, "ceph-users" 
<ceph-users@lists.ceph.com>
Sent: Saturday, September 5, 2015 9:40:06 AM
Subject: RE: [ceph-users] Ceph performance, empty vs part full

----------------------------------------
> Date: Fri, 4 Sep 2015 20:31:59 -0400
> From: ski...@redhat.com
> To: yguan...@outlook.com
> CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
>> IIRC, it only triggers the move (merge or split) when that folder is hit by 
>> a request, so most likely it happens gradually.
>
> Do you know what causes this?
A requests (read/write/setxattr, etc) hitting objects in that folder.
> I would like to be more clear "gradually".
>
> Shinobu
>
> ----- Original Message -----
> From: "GuangYang" <yguan...@outlook.com>
> To: "Ben Hines" <bhi...@gmail.com>, "Nick Fisk" <n...@fisk.me.uk>
> Cc: "ceph-users" <ceph-users@lists.ceph.com>
> Sent: Saturday, September 5, 2015 9:27:31 AM
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> IIRC, it only triggers the move (merge or split) when that folder is hit by a 
> request, so most likely it happens gradually.
>
> Another thing might be helpful (and we have had good experience with), is 
> that we do the folder splitting at the pool creation time, so that we avoid 
> the performance impact with runtime splitting (which is high if you have a 
> large cluster). In order to do that:
>
> 1. You will need to configure "filestore merge threshold" with a negative 
> value so that it disables merging.
> 2. When creating the pool, there is a parameter named "expected_num_objects", 
> by specifying that number, the folder will splitted to the right level with 
> the pool creation.
>
> Hope that helps.
>
> Thanks,
> Guang
>
>
> ----------------------------------------
>> From: bhi...@gmail.com
>> Date: Fri, 4 Sep 2015 12:05:26 -0700
>> To: n...@fisk.me.uk
>> CC: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
>> a ticket to request a way to tell an OSD to rebalance its directory
>> structure.
>>
>> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk <n...@fisk.me.uk> wrote:
>>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>>> similar size to yours. I didn't see any merging happening, although most of 
>>> the directory's I looked at had more files in than the new merge threshold, 
>>> so I guess this is to be expected
>>>
>>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>>> bring things back into order.
>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>> Wang, Warren
>>>> Sent: 04 September 2015 01:21
>>>> To: Mark Nelson <mnel...@redhat.com>; Ben Hines <bhi...@gmail.com>
>>>> Cc: ceph-users <ceph-users@lists.ceph.com>
>>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>>
>>>> I'm about to change it on a big cluster too. It totals around 30 million, 
>>>> so I'm a
>>>> bit nervous on changing it. As far as I understood, it would indeed move
>>>> them around, if you can get underneath the threshold, but it may be hard to
>>>> do. Two more settings that I highly recommend changing on a big prod
>>>> cluster. I'm in favor of bumping these two up in the defaults.
>>>>
>>>> Warren
>>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>> Mark Nelson
>>>> Sent: Thursday, September 03, 2015 6:04 PM
>>>> To: Ben Hines <bhi...@gmail.com>
>>>> Cc: ceph-users <ceph-users@lists.ceph.com>
>>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>>
>>>> Hrm, I think it will follow the merge/split rules if it's out of whack 
>>>> given the
>>>> new settings, but I don't know that I've ever tested it on an existing 
>>>> cluster to
>>>> see that it actually happens. I guess let it sit for a while and then 
>>>> check the
>>>> OSD PG directories to see if the object counts make sense given the new
>>>> settings? :D
>>>>
>>>> Mark
>>>>
>>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
>>>>> Hey Mark,
>>>>>
>>>>> I've just tweaked these filestore settings for my cluster -- after
>>>>> changing this, is there a way to make ceph move existing objects
>>>>> around to new filestore locations, or will this only apply to newly
>>>>> created objects? (i would assume the latter..)
>>>>>
>>>>> thanks,
>>>>>
>>>>> -Ben
>>>>>
>>>>> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson <mnel...@redhat.com>
>>>> wrote:
>>>>>> Basically for each PG, there's a directory tree where only a certain
>>>>>> number of objects are allowed in a given directory before it splits
>>>>>> into new branches/leaves. The problem is that this has a fair amount
>>>>>> of overhead and also there's extra associated dentry lookups to get at 
>>>>>> any
>>>> given object.
>>>>>>
>>>>>> You may want to try something like:
>>>>>>
>>>>>> "filestore merge threshold = 40"
>>>>>> "filestore split multiple = 8"
>>>>>>
>>>>>> This will dramatically increase the number of objects per directory
>>>> allowed.
>>>>>>
>>>>>> Another thing you may want to try is telling the kernel to greatly
>>>>>> favor retaining dentries and inodes in cache:
>>>>>>
>>>>>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>>>>>>>
>>>>>>> If I create a new pool it is generally fast for a short amount of time.
>>>>>>> Not as fast as if I had a blank cluster, but close to.
>>>>>>>
>>>>>>> Bryn
>>>>>>>>
>>>>>>>> On 8 Jul 2015, at 13:55, Gregory Farnum <g...@gregs42.com> wrote:
>>>>>>>>
>>>>>>>> I think you're probably running into the internal PG/collection
>>>>>>>> splitting here; try searching for those terms and seeing what your
>>>>>>>> OSD folder structures look like. You could test by creating a new
>>>>>>>> pool and seeing if it's faster or slower than the one you've already 
>>>>>>>> filled
>>>> up.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>>>>>>>> <bryn.math...@alcatel-lucent.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m perf testing a cluster again,
>>>>>>>>> This time I have re-built the cluster and am filling it for testing.
>>>>>>>>>
>>>>>>>>> on a 10 min run I get the following results from 5 load
>>>>>>>>> generators, each writing though 7 iocontexts, with a queue depth of
>>>> 50 async writes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gen1
>>>>>>>>> Percentile 100 = 0.729775905609
>>>>>>>>> Max latencies = 0.729775905609, Min = 0.0320818424225, mean =
>>>>>>>>> 0.0750389684542
>>>>>>>>> Total objects writen = 113088 in time 604.259738207s gives
>>>>>>>>> 187.151307376/s (748.605229503 MB/s)
>>>>>>>>>
>>>>>>>>> Gen2
>>>>>>>>> Percentile 100 = 0.735981941223
>>>>>>>>> Max latencies = 0.735981941223, Min = 0.0340068340302, mean =
>>>>>>>>> 0.0745198070711
>>>>>>>>> Total objects writen = 113822 in time 604.437897921s gives
>>>>>>>>> 188.310495407/s (753.241981627 MB/s)
>>>>>>>>>
>>>>>>>>> Gen3
>>>>>>>>> Percentile 100 = 0.828994989395
>>>>>>>>> Max latencies = 0.828994989395, Min = 0.0349340438843, mean =
>>>>>>>>> 0.0745455575197
>>>>>>>>> Total objects writen = 113670 in time 604.352181911s gives
>>>>>>>>> 188.085694736/s (752.342778944 MB/s)
>>>>>>>>>
>>>>>>>>> Gen4
>>>>>>>>> Percentile 100 = 1.06834602356
>>>>>>>>> Max latencies = 1.06834602356, Min = 0.0333499908447, mean =
>>>>>>>>> 0.0752239764659
>>>>>>>>> Total objects writen = 112744 in time 604.408732891s gives
>>>>>>>>> 186.536020849/s (746.144083397 MB/s)
>>>>>>>>>
>>>>>>>>> Gen5
>>>>>>>>> Percentile 100 = 0.609658002853
>>>>>>>>> Max latencies = 0.609658002853, Min = 0.032968044281, mean =
>>>>>>>>> 0.0744482759499
>>>>>>>>> Total objects writen = 113918 in time 604.671534061s gives
>>>>>>>>> 188.396498897/s (753.585995589 MB/s)
>>>>>>>>>
>>>>>>>>> example ceph -w output:
>>>>>>>>> 2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 pgs: 2880
>>>>>>>>> active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB avail;
>>>>>>>>> active+2185 MB/s
>>>>>>>>> wr, 572 op/s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However when the cluster gets over 20% full I see the following
>>>>>>>>> results, this gets worse as the cluster fills up:
>>>>>>>>>
>>>>>>>>> Gen1
>>>>>>>>> Percentile 100 = 6.71176099777
>>>>>>>>> Max latencies = 6.71176099777, Min = 0.0358741283417, mean =
>>>>>>>>> 0.161760483485
>>>>>>>>> Total objects writen = 52196 in time 604.488474131s gives
>>>>>>>>> 86.347386648/s
>>>>>>>>> (345.389546592 MB/s)
>>>>>>>>>
>>>>>>>>> Gen2
>>>>>>>>> Max latencies = 4.09169006348, Min = 0.0357890129089, mean =
>>>>>>>>> 0.163243938477
>>>>>>>>> Total objects writen = 51702 in time 604.036739111s gives
>>>>>>>>> 85.5941313704/s (342.376525482 MB/s)
>>>>>>>>>
>>>>>>>>> Gen3
>>>>>>>>> Percentile 100 = 7.32526683807
>>>>>>>>> Max latencies = 7.32526683807, Min = 0.0366668701172, mean =
>>>>>>>>> 0.163992217926
>>>>>>>>> Total objects writen = 51476 in time 604.684302092s gives
>>>>>>>>> 85.1287189397/s (340.514875759 MB/s)
>>>>>>>>>
>>>>>>>>> Gen4
>>>>>>>>> Percentile 100 = 7.56094503403
>>>>>>>>> Max latencies = 7.56094503403, Min = 0.0355761051178, mean =
>>>>>>>>> 0.162109421231
>>>>>>>>> Total objects writen = 52092 in time 604.769910812s gives
>>>>>>>>> 86.1352376642/s (344.540950657 MB/s)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gen5
>>>>>>>>> Percentile 100 = 6.99595499039
>>>>>>>>> Max latencies = 6.99595499039, Min = 0.0364680290222, mean =
>>>>>>>>> 0.163651215426
>>>>>>>>> Total objects writen = 51566 in time 604.061977148s gives
>>>>>>>>> 85.3654127404/s (341.461650961 MB/s)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cluster details:
>>>>>>>>> 5*HPDL380’s with 13*6Tb OSD’s
>>>>>>>>> 128Gb Ram
>>>>>>>>> 2*intel 2620v3
>>>>>>>>> 10 Gbit Ceph public network
>>>>>>>>> 10 Gbit Ceph private network
>>>>>>>>>
>>>>>>>>> Load generators connected via a 20Gbit bond to the ceph public
>>>> network.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is this likely to be something happening to the journals?
>>>>>>>>>
>>>>>>>>> Or is there something else going on.
>>>>>>>>>
>>>>>>>>> I have run FIO and iperf tests and the disk and network
>>>>>>>>> performance is very high.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>> Bryn Mathias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                                         
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph performance, empty vs part full

Reply via email to