> filestore_fd_cache_random = true not true
Shinobu On Fri, Aug 21, 2015 at 10:20 PM, Jan Schermer <j...@schermer.cz> wrote: > Thanks for the config, > few comments inline:, not really related to the issue > > > On 21 Aug 2015, at 15:12, J-P Methot <jpmet...@gtcomm.net> wrote: > > > > Hi, > > > > First of all, we are sure that the return to the default configuration > > fixed it. As soon as we restarted only one of the ceph nodes with the > > default configuration, it sped up recovery tremedously. We had already > > restarted before with the old conf and recovery was never that fast. > > > > Regarding the configuration, here's the old one with comments : > > > > [global] > > fsid = ************************* > > mon_initial_members = cephmon1 > > mon_host = ******************* > > auth_cluster_required = cephx > > auth_service_required = cephx > > auth_client_required = cephx > > filestore_xattr_use_omap = true // > > Let's you use xattributes of xfs/ext4/btrfs filesystems > > This actually did the opposite, but this option doesn't exist anymore > > > osd_pool_default_pgp_num = 450 // > > default pgp number for new pools > > osd_pg_bits = 12 // > > number of bits used to designate pgps. Lets you have 2^12 pgps > > Could someone comment on those? What exactly does it do? What if I have > more PGs than num_osds*osd_pg_bits? > > > osd_pool_default_size = 3 // > > default copy number for new pools > > osd_pool_default_pg_num = 450 // > > default pg number for new pools > > public_network = ************* > > cluster_network = *************** > > osd_pgp_bits = 12 // > > number of bits used to designate pgps. Let's you have 2^12 pgps > > > > [osd] > > filestore_queue_max_ops = 5000 // set to 500 by default Defines the > > maximum number of in progress operations the file store accepts before > > blocking on queuing new operations. > > filestore_fd_cache_random = true // ???? > > No docs, I don't see this in my ancient cluster :-) > > > journal_queue_max_ops = 1000000 // set > > to 500 by default. Number of operations allowed in the journal queue > > filestore_omap_header_cache_size = 1000000 // Determines > > the size of the LRU used to cache object omap headers. Larger values use > > more memory but may reduce lookups on omap. > > filestore_fd_cache_size = 1000000 // > > You don't really need to set this so high, but not sure what the > implications are if you go too high (it probably doesn't eat more memory > until it opens so many files). If you have 4MB object on a 1TB drive than > you really only need 250K to keep all files open. > > not in the ceph documentation. Seems to be a common tweak for SSD > > clusters though. > > max_open_files = 1000000 // > > lets ceph set the max file descriptor in the OS to prevent running out > > of file descriptors > > This is too low if you were really using all of the fd_cache. There are > going to be thousands of tcp connection which need to be accounted for as > well. > (in my experience there can be hundreds to thousands tcp connection from > just one RBD client and 200 OSDs, which is a lot). > > > > osd_journal_size = 10000 // > > journal max size for each OSD > > > > New conf: > > > > [global] > > fsid = ************************* > > mon_initial_members = cephmon1 > > mon_host = ************ > > auth_cluster_required = cephx > > auth_service_required = cephx > > auth_client_required = cephx > > public_network = ****************** > > cluster_network = ****************** > > > > You might notice, I have a few undocumented settings in the old > > configuration. These are settings I took from a certain openstack summit > > presentation and they may have contributed to this whole problem. Here's > > a list of settings that I think might be a possible cause for these > > speed issues: > > > > filestore_fd_cache_random = true > > filestore_fd_cache_size = 1000000 > > > > Additionally, my colleague thinks these settings may have contributed : > > > > filestore_queue_max_ops = 5000 > > journal_queue_max_ops = 1000000 > > > > We will do further tests on these settings once we have our lab ceph > > test environment as we are also curious as to exactly what caused this. > > > > > > On 2015-08-20 11:43 AM, Alex Gorbachev wrote: > >>> > >>> Just to update the mailing list, we ended up going back to default > >>> ceph.conf without any additional settings than what is mandatory. We > are > >>> now reaching speeds we never reached before, both in recovery and in > >>> regular usage. There was definitely something we set in the ceph.conf > >>> bogging everything down. > >> > >> Could you please share the old and new ceph.conf, or the section that > >> was removed? > >> > >> Best regards, > >> Alex > >> > >>> > >>> > >>> On 2015-08-20 4:06 AM, Christian Balzer wrote: > >>>> > >>>> Hello, > >>>> > >>>> from all the pertinent points by Somnath, the one about > pre-conditioning > >>>> would be pretty high on my list, especially if this slowness persists > and > >>>> nothing else (scrub) is going on. > >>>> > >>>> This might be "fixed" by doing a fstrim. > >>>> > >>>> Additionally the levelDB's per OSD are of course sync'ing heavily > during > >>>> reconstruction, so that might not be the favorite thing for your type > of > >>>> SSDs. > >>>> > >>>> But ultimately situational awareness is very important, as in "what" > is > >>>> actually going and slowing things down. > >>>> As usual my recommendations would be to use atop, iostat or similar > on all > >>>> your nodes and see if your OSD SSDs are indeed the bottleneck or if > it is > >>>> maybe just one of them or something else entirely. > >>>> > >>>> Christian > >>>> > >>>> On Wed, 19 Aug 2015 20:54:11 +0000 Somnath Roy wrote: > >>>> > >>>>> Also, check if scrubbing started in the cluster or not. That may > >>>>> considerably slow down the cluster. > >>>>> > >>>>> -----Original Message----- > >>>>> From: Somnath Roy > >>>>> Sent: Wednesday, August 19, 2015 1:35 PM > >>>>> To: 'J-P Methot'; ceph-us...@ceph.com > >>>>> Subject: RE: [ceph-users] Bad performances in recovery > >>>>> > >>>>> All the writes will go through the journal. > >>>>> It may happen your SSDs are not preconditioned well and after a lot > of > >>>>> writes during recovery IOs are stabilized to lower number. This is > quite > >>>>> common for SSDs if that is the case. > >>>>> > >>>>> Thanks & Regards > >>>>> Somnath > >>>>> > >>>>> -----Original Message----- > >>>>> From: J-P Methot [mailto:jpmet...@gtcomm.net] > >>>>> Sent: Wednesday, August 19, 2015 1:03 PM > >>>>> To: Somnath Roy; ceph-us...@ceph.com > >>>>> Subject: Re: [ceph-users] Bad performances in recovery > >>>>> > >>>>> Hi, > >>>>> > >>>>> Thank you for the quick reply. However, we do have those exact > settings > >>>>> for recovery and it still strongly affects client io. I have looked > at > >>>>> various ceph logs and osd logs and nothing is out of the ordinary. > >>>>> Here's an idea though, please tell me if I am wrong. > >>>>> > >>>>> We use intel SSDs for journaling and samsung SSDs as proper OSDs. As > was > >>>>> explained several times on this mailing list, Samsung SSDs suck in > ceph. > >>>>> They have horrible O_dsync speed and die easily, when used as > journal. > >>>>> That's why we're using Intel ssds for journaling, so that we didn't > end > >>>>> up putting 96 samsung SSDs in the trash. > >>>>> > >>>>> In recovery though, what is the ceph behaviour? What kind of write > does > >>>>> it do on the OSD SSDs? Does it write directly to the SSDs or through > the > >>>>> journal? > >>>>> > >>>>> Additionally, something else we notice: the ceph cluster is MUCH > slower > >>>>> after recovery than before. Clearly there is a bottleneck somewhere > and > >>>>> that bottleneck does not get cleared up after the recovery is done. > >>>>> > >>>>> > >>>>> On 2015-08-19 3:32 PM, Somnath Roy wrote: > >>>>>> If you are concerned about *client io performance* during recovery, > >>>>>> use these settings.. > >>>>>> > >>>>>> osd recovery max active = 1 > >>>>>> osd max backfills = 1 > >>>>>> osd recovery threads = 1 > >>>>>> osd recovery op priority = 1 > >>>>>> > >>>>>> If you are concerned about *recovery performance*, you may want to > >>>>>> bump this up, but I doubt it will help much from default settings.. > >>>>>> > >>>>>> Thanks & Regards > >>>>>> Somnath > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > Behalf > >>>>>> Of J-P Methot > >>>>>> Sent: Wednesday, August 19, 2015 12:17 PM > >>>>>> To: ceph-us...@ceph.com > >>>>>> Subject: [ceph-users] Bad performances in recovery > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> Our setup is currently comprised of 5 OSD nodes with 12 OSD each, > for > >>>>>> a total of 60 OSDs. All of these are SSDs with 4 SSD journals on > each. > >>>>>> The ceph version is hammer v0.94.1 . There is a performance overhead > >>>>>> because we're using SSDs (I've heard it gets better in infernalis, > but > >>>>>> we're not upgrading just yet) but we can reach numbers that I would > >>>>>> consider "alright". > >>>>>> > >>>>>> Now, the issue is, when the cluster goes into recovery it's very > fast > >>>>>> at first, but then slows down to ridiculous levels as it moves > >>>>>> forward. You can go from 7% to 2% to recover in ten minutes, but it > >>>>>> may take 2 hours to recover the last 2%. While this happens, the > >>>>>> attached openstack setup becomes incredibly slow, even though there > is > >>>>>> only a small fraction of objects still recovering (less than 1%). > The > >>>>>> settings that may affect recovery speed are very low, as they are by > >>>>>> default, yet they still affect client io speed way more than it > should. > >>>>>> > >>>>>> Why would ceph recovery become so slow as it progress and affect > >>>>>> client io even though it's recovering at a snail's pace? And by a > >>>>>> snail's pace, I mean a few kb/second on 10gbps uplinks. -- > >>>>>> ====================== Jean-Philippe Méthot > >>>>>> Administrateur système / System administrator GloboTech > Communications > >>>>>> Phone: 1-514-907-0050 > >>>>>> Toll Free: 1-(888)-GTCOMM1 > >>>>>> Fax: 1-(514)-907-0750 > >>>>>> jpmet...@gtcomm.net > >>>>>> http://www.gtcomm.net > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list > >>>>>> ceph-users@lists.ceph.com > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>>> > >>>>>> ________________________________ > >>>>>> > >>>>>> PLEASE NOTE: The information contained in this electronic mail > message > >>>>>> is intended only for the use of the designated recipient(s) named > >>>>>> above. If the reader of this message is not the intended recipient, > >>>>>> you are hereby notified that you have received this message in error > >>>>>> and that any review, dissemination, distribution, or copying of this > >>>>>> message is strictly prohibited. If you have received this > >>>>>> communication in error, please notify the sender by telephone or > >>>>>> e-mail (as shown above) immediately and destroy any and all copies > of > >>>>>> this message in your possession (whether hard copies or > electronically > >>>>>> stored copies). > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> ====================== > >>>>> Jean-Philippe Méthot > >>>>> Administrateur système / System administrator GloboTech > Communications > >>>>> Phone: 1-514-907-0050 > >>>>> Toll Free: 1-(888)-GTCOMM1 > >>>>> Fax: 1-(514)-907-0750 > >>>>> jpmet...@gtcomm.net > >>>>> http://www.gtcomm.net > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@lists.ceph.com > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> > >>>> > >>> > >>> > >>> -- > >>> ====================== > >>> Jean-Philippe Méthot > >>> Administrateur système / System administrator > >>> GloboTech Communications > >>> Phone: 1-514-907-0050 > >>> Toll Free: 1-(888)-GTCOMM1 > >>> Fax: 1-(514)-907-0750 > >>> jpmet...@gtcomm.net > >>> http://www.gtcomm.net > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > > ====================== > > Jean-Philippe Méthot > > Administrateur système / System administrator > > GloboTech Communications > > Phone: 1-514-907-0050 > > Toll Free: 1-(888)-GTCOMM1 > > Fax: 1-(514)-907-0750 > > jpmet...@gtcomm.net > > http://www.gtcomm.net > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Email: shin...@linux.com ski...@redhat.com Life w/ Linux <http://i-shinobu.hatenablog.com/>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com