Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

Wade Holler Fri, 24 Jun 2016 07:23:56 -0700

On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
think it is the best choice for most configs.  However with our large
memory footprint, vfs_cache_pressure=1 increased the likelihood of
hitting an issue where our write response time would double; then a
drop of caches would return response time to normal.  I don't claim to
totally understand this and I only have speculation at the moment.
Again thanks for this suggestion, I do think it is best for boxes that
don't have very large memory.


@ Christian - reformatting to btrfs or ext4 is an option in my test
cluster.  I thought about that but needed to sort xfs first. (thats
what production will run right now) You all have helped me do that and
thank you again.  I will circle back and test btrfs under the same
conditions.  I suspect that it will behave similarly but it's only a
day and half's work or so to test.

Best Regards,
Wade


On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <somnath....@sandisk.com> wrote:
> Oops , typo , 128 GB :-)...
>
> -----Original Message-----
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: Thursday, June 23, 2016 5:08 PM
> To: ceph-users@lists.ceph.com
> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph 
> Development
> Subject: Re: [ceph-users] Dramatic performance drop at certain number of 
> objects in pool
>
>
> Hello,
>
> On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote:
>
>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>> *pin* inode/dentries in memory. We are using that for long now (with
>> 128 TB node memory) and it seems helping specially for the random
>> write workload and saving xattrs read in between.
>>
> 128TB node memory, really?
> Can I have some of those, too? ^o^
> And here I was thinking that Wade's 660GB machines were on the excessive side.
>
> There's something to be said (and optimized) when your storage nodes have the 
> same or more RAM as your compute nodes...
>
> As for Warren, well spotted.
> I personally use vm.vfs_cache_pressure = 1, this avoids the potential 
> fireworks if your memory is really needed elsewhere, while keeping things in 
> memory normally.
>
> Christian
>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>> To: Wade Holler; Blair Bethwaite
>> Cc: Ceph Development; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>> of objects in pool
>>
>> vm.vfs_cache_pressure = 100
>>
>> Go the other direction on that. You¹ll want to keep it low to help
>> keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
>>
>>
>> Warren Wang
>>
>>
>>
>>
>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.hol...@gmail.com> wrote:
>>
>> >Blairo,
>> >
>> >We'll speak in pre-replication numbers, replication for this pool is 3.
>> >
>> >23.3 Million Objects / OSD
>> >pg_num 2048
>> >16 OSDs / Server
>> >3 Servers
>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>> >vm.vfs_cache_pressure = 100
>> >
>> >Workload is native librados with python.  ALL 4k objects.
>> >
>> >Best Regards,
>> >Wade
>> >
>> >
>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>> ><blair.bethwa...@gmail.com> wrote:
>> >> Wade, good to know.
>> >>
>> >> For the record, what does this work out to roughly per OSD? And how
>> >> much RAM and how many PGs per OSD do you have?
>> >>
>> >> What's your workload? I wonder whether for certain workloads (e.g.
>> >> RBD) it's better to increase default object size somewhat before
>> >> pushing the split/merge up a lot...
>> >>
>> >> Cheers,
>> >>
>> >> On 23 June 2016 at 11:26, Wade Holler <wade.hol...@gmail.com> wrote:
>> >>> Based on everyones suggestions; The first modification to 50 / 16
>> >>> enabled our config to get to ~645Mill objects before the behavior
>> >>> in question was observed (~330 was the previous ceiling).
>> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
>> >>> Billion+
>> >>>
>> >>> Thank you all very much for your support and assistance.
>> >>>
>> >>> Best Regards,
>> >>> Wade
>> >>>
>> >>>
>> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <ch...@gol.com>
>> >>>wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote:
>> >>>>
>> >>>>> Sorry, late to the party here. I agree, up the merge and split
>> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
>> >>>>>here.
>> >>>>> One of those things you just have to find out as an operator
>> >>>>>since it's  not well documented :(
>> >>>>>
>> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>> >>>>>
>> >>>>> We have over 200 million objects in this cluster, and it's still
>> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
>> >>>>>drives
>> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
>> >>>>>vfs_cache_pressure  should also help.
>> >>>>>
>> >>>> Indeed.
>> >>>>
>> >>>> Since it was asked in that bug report and also my first
>> >>>>suspicion, it  would probably be good time to clarify that it
>> >>>>isn't the splits that cause  the performance degradation, but the
>> >>>>resulting inflation of dir entries  and exhaustion of SLAB and
>> >>>>thus having to go to disk for things that  normally would be in memory.
>> >>>>
>> >>>> Looking at Blair's graph from yesterday pretty much makes that
>> >>>>clear, a  purely split caused degradation should have relented
>> >>>>much quicker.
>> >>>>
>> >>>>
>> >>>>> Keep in mind that if you change the values, it won't take effect
>> >>>>> immediately. It only merges them back if the directory is under
>> >>>>> the calculated threshold and a write occurs (maybe a read, I
>> >>>>> forget).
>> >>>>>
>> >>>> If it's a read a plain scrub might do the trick.
>> >>>>
>> >>>> Christian
>> >>>>> Warren
>> >>>>>
>> >>>>>
>> >>>>> From: ceph-users
>> >>>>>
>> >>>>><ceph-users-boun...@lists.ceph.com<mailto:ceph-users-bounces@lists.
>> >>>>>cep
>> >>>>>h.com>>
>> >>>>> on behalf of Wade Holler
>> >>>>> <wade.hol...@gmail.com<mailto:wade.hol...@gmail.com>> Date:
>> >>>>>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>> >>>>><blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>>,
>> >>>>>Wido den  Hollander <w...@42on.com<mailto:w...@42on.com>> Cc:
>> >>>>>Ceph Development
>> >>>>><ceph-de...@vger.kernel.org<mailto:ceph-de...@vger.kernel.org>>,
>> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>> >>>>>Subject:
>> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of
>> >>>>>objects  in pool
>> >>>>>
>> >>>>> Thanks everyone for your replies.  I sincerely appreciate it. We
>> >>>>> are testing with different pg_num and filestore_split_multiple
>> >>>>> settings. Early indications are .... well not great. Regardless
>> >>>>> it is nice to understand the symptoms better so we try to design
>> >>>>> around it.
>> >>>>>
>> >>>>> Best Regards,
>> >>>>> Wade
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> >>>>><blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>> wrote:
>> >>>>>On
>> >>>>> 20 June 2016 at 09:21, Blair Bethwaite
>> >>>>><blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>> wrote:
>> >>>>> > slow request issues). If you watch your xfs stats you'll
>> >>>>> > likely get further confirmation. In my experience
>> >>>>> > xs_dir_lookups balloons
>> >>>>>(which
>> >>>>> > means directory lookups are missing cache and going to disk).
>> >>>>>
>> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer
>> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this
>> >>>>> very problem we had only ephemerally set the new filestore
>> >>>>> merge/split values - oops. Here's what started happening when we
>> >>>>> upgraded and restarted a bunch of OSDs:
>> >>>>>
>> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs
>> >>>>>_d
>> >>>>>ir_
>> >>>>>lookup.png
>> >>>>>
>> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about
>> >>>>> 12:30, then still took a while to settle.
>> >>>>>
>> >>>>> --
>> >>>>> Cheers,
>> >>>>> ~Blairo
>> >>>>>
>> >>>>> This email and any files transmitted with it are confidential
>> >>>>>and intended solely for the individual or entity to whom they are
>> >>>>>addressed.
>> >>>>> If you have received this email in error destroy it immediately.
>> >>>>>***  Walmart Confidential ***
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Christian Balzer        Network/Systems Engineer
>> >>>> ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> >>>> http://www.gol.com/
>> >>
>> >>
>> >>
>> >> --
>> >> Cheers,
>> >> ~Blairo
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed.
>> If you have received this email in error destroy it immediately. ***
>> Walmart Confidential ***
>> _______________________________________________
>> ceph-users mailing list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE:
>> The information contained in this electronic mail message is intended
>> only for the use of the designated recipient(s) named above. If the
>> reader of this message is not the intended recipient, you are hereby
>> notified that you have received this message in error and that any
>> review, dissemination, distribution, or copying of this message is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender by telephone or e-mail (as shown above)
>> immediately and destroy any and all copies of this message in your
>> possession (whether hard copies or electronically stored copies).
>> _______________________________________________ ceph-users mailing
>> list ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

Reply via email to