On the vm.vfs_cace_pressure = 1 : We had this initially and I still think it is the best choice for most configs. However with our large memory footprint, vfs_cache_pressure=1 increased the likelihood of hitting an issue where our write response time would double; then a drop of caches would return response time to normal. I don't claim to totally understand this and I only have speculation at the moment. Again thanks for this suggestion, I do think it is best for boxes that don't have very large memory.
@ Christian - reformatting to btrfs or ext4 is an option in my test cluster. I thought about that but needed to sort xfs first. (thats what production will run right now) You all have helped me do that and thank you again. I will circle back and test btrfs under the same conditions. I suspect that it will behave similarly but it's only a day and half's work or so to test. Best Regards, Wade On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <somnath....@sandisk.com> wrote: > Oops , typo , 128 GB :-)... > > -----Original Message----- > From: Christian Balzer [mailto:ch...@gol.com] > Sent: Thursday, June 23, 2016 5:08 PM > To: ceph-users@lists.ceph.com > Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph > Development > Subject: Re: [ceph-users] Dramatic performance drop at certain number of > objects in pool > > > Hello, > > On Thu, 23 Jun 2016 22:24:59 +0000 Somnath Roy wrote: > >> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to >> *pin* inode/dentries in memory. We are using that for long now (with >> 128 TB node memory) and it seems helping specially for the random >> write workload and saving xattrs read in between. >> > 128TB node memory, really? > Can I have some of those, too? ^o^ > And here I was thinking that Wade's 660GB machines were on the excessive side. > > There's something to be said (and optimized) when your storage nodes have the > same or more RAM as your compute nodes... > > As for Warren, well spotted. > I personally use vm.vfs_cache_pressure = 1, this avoids the potential > fireworks if your memory is really needed elsewhere, while keeping things in > memory normally. > > Christian > >> Thanks & Regards >> Somnath >> >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM >> To: Wade Holler; Blair Bethwaite >> Cc: Ceph Development; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Dramatic performance drop at certain number >> of objects in pool >> >> vm.vfs_cache_pressure = 100 >> >> Go the other direction on that. You¹ll want to keep it low to help >> keep inode/dentry info in memory. We use 10, and haven¹t had a problem. >> >> >> Warren Wang >> >> >> >> >> On 6/22/16, 9:41 PM, "Wade Holler" <wade.hol...@gmail.com> wrote: >> >> >Blairo, >> > >> >We'll speak in pre-replication numbers, replication for this pool is 3. >> > >> >23.3 Million Objects / OSD >> >pg_num 2048 >> >16 OSDs / Server >> >3 Servers >> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1 >> >vm.vfs_cache_pressure = 100 >> > >> >Workload is native librados with python. ALL 4k objects. >> > >> >Best Regards, >> >Wade >> > >> > >> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite >> ><blair.bethwa...@gmail.com> wrote: >> >> Wade, good to know. >> >> >> >> For the record, what does this work out to roughly per OSD? And how >> >> much RAM and how many PGs per OSD do you have? >> >> >> >> What's your workload? I wonder whether for certain workloads (e.g. >> >> RBD) it's better to increase default object size somewhat before >> >> pushing the split/merge up a lot... >> >> >> >> Cheers, >> >> >> >> On 23 June 2016 at 11:26, Wade Holler <wade.hol...@gmail.com> wrote: >> >>> Based on everyones suggestions; The first modification to 50 / 16 >> >>> enabled our config to get to ~645Mill objects before the behavior >> >>> in question was observed (~330 was the previous ceiling). >> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1 >> >>> Billion+ >> >>> >> >>> Thank you all very much for your support and assistance. >> >>> >> >>> Best Regards, >> >>> Wade >> >>> >> >>> >> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <ch...@gol.com> >> >>>wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> On Mon, 20 Jun 2016 20:47:32 +0000 Warren Wang - ISD wrote: >> >>>> >> >>>>> Sorry, late to the party here. I agree, up the merge and split >> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket >> >>>>>here. >> >>>>> One of those things you just have to find out as an operator >> >>>>>since it's not well documented :( >> >>>>> >> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974 >> >>>>> >> >>>>> We have over 200 million objects in this cluster, and it's still >> >>>>>doing over 15000 write IOPS all day long with 302 spinning >> >>>>>drives >> >>>>>+ SATA SSD journals. Having enough memory and dropping your >> >>>>>vfs_cache_pressure should also help. >> >>>>> >> >>>> Indeed. >> >>>> >> >>>> Since it was asked in that bug report and also my first >> >>>>suspicion, it would probably be good time to clarify that it >> >>>>isn't the splits that cause the performance degradation, but the >> >>>>resulting inflation of dir entries and exhaustion of SLAB and >> >>>>thus having to go to disk for things that normally would be in memory. >> >>>> >> >>>> Looking at Blair's graph from yesterday pretty much makes that >> >>>>clear, a purely split caused degradation should have relented >> >>>>much quicker. >> >>>> >> >>>> >> >>>>> Keep in mind that if you change the values, it won't take effect >> >>>>> immediately. It only merges them back if the directory is under >> >>>>> the calculated threshold and a write occurs (maybe a read, I >> >>>>> forget). >> >>>>> >> >>>> If it's a read a plain scrub might do the trick. >> >>>> >> >>>> Christian >> >>>>> Warren >> >>>>> >> >>>>> >> >>>>> From: ceph-users >> >>>>> >> >>>>><ceph-users-boun...@lists.ceph.com<mailto:ceph-users-bounces@lists. >> >>>>>cep >> >>>>>h.com>> >> >>>>> on behalf of Wade Holler >> >>>>> <wade.hol...@gmail.com<mailto:wade.hol...@gmail.com>> Date: >> >>>>>Monday, June 20, 2016 at 2:48 PM To: Blair Bethwaite >> >>>>><blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>>, >> >>>>>Wido den Hollander <w...@42on.com<mailto:w...@42on.com>> Cc: >> >>>>>Ceph Development >> >>>>><ceph-de...@vger.kernel.org<mailto:ceph-de...@vger.kernel.org>>, >> >>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" >> >>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> >> >>>>>Subject: >> >>>>> Re: [ceph-users] Dramatic performance drop at certain number of >> >>>>>objects in pool >> >>>>> >> >>>>> Thanks everyone for your replies. I sincerely appreciate it. We >> >>>>> are testing with different pg_num and filestore_split_multiple >> >>>>> settings. Early indications are .... well not great. Regardless >> >>>>> it is nice to understand the symptoms better so we try to design >> >>>>> around it. >> >>>>> >> >>>>> Best Regards, >> >>>>> Wade >> >>>>> >> >>>>> >> >>>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite >> >>>>><blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>> wrote: >> >>>>>On >> >>>>> 20 June 2016 at 09:21, Blair Bethwaite >> >>>>><blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>> wrote: >> >>>>> > slow request issues). If you watch your xfs stats you'll >> >>>>> > likely get further confirmation. In my experience >> >>>>> > xs_dir_lookups balloons >> >>>>>(which >> >>>>> > means directory lookups are missing cache and going to disk). >> >>>>> >> >>>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer >> >>>>> in preparation for Jewel/RHCS2. Turns out when we last hit this >> >>>>> very problem we had only ephemerally set the new filestore >> >>>>> merge/split values - oops. Here's what started happening when we >> >>>>> upgraded and restarted a bunch of OSDs: >> >>>>> >> >>>>>https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs >> >>>>>_d >> >>>>>ir_ >> >>>>>lookup.png >> >>>>> >> >>>>> Seemed to cause lots of slow requests :-/. We corrected it about >> >>>>> 12:30, then still took a while to settle. >> >>>>> >> >>>>> -- >> >>>>> Cheers, >> >>>>> ~Blairo >> >>>>> >> >>>>> This email and any files transmitted with it are confidential >> >>>>>and intended solely for the individual or entity to whom they are >> >>>>>addressed. >> >>>>> If you have received this email in error destroy it immediately. >> >>>>>*** Walmart Confidential *** >> >>>> >> >>>> >> >>>> -- >> >>>> Christian Balzer Network/Systems Engineer >> >>>> ch...@gol.com Global OnLine Japan/Rakuten Communications >> >>>> http://www.gol.com/ >> >> >> >> >> >> >> >> -- >> >> Cheers, >> >> ~Blairo >> >> This email and any files transmitted with it are confidential and >> intended solely for the individual or entity to whom they are addressed. >> If you have received this email in error destroy it immediately. *** >> Walmart Confidential *** >> _______________________________________________ >> ceph-users mailing list ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: >> The information contained in this electronic mail message is intended >> only for the use of the designated recipient(s) named above. If the >> reader of this message is not the intended recipient, you are hereby >> notified that you have received this message in error and that any >> review, dissemination, distribution, or copying of this message is >> strictly prohibited. If you have received this communication in error, >> please notify the sender by telephone or e-mail (as shown above) >> immediately and destroy any and all copies of this message in your >> possession (whether hard copies or electronically stored copies). >> _______________________________________________ ceph-users mailing >> list ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com