Re: [ceph-users] Bad performances in recovery

Shinobu Kinjo Fri, 21 Aug 2015 06:33:14 -0700

> filestore_fd_cache_random = true

not true


Shinobu

On Fri, Aug 21, 2015 at 10:20 PM, Jan Schermer <j...@schermer.cz> wrote:

> Thanks for the config,
> few comments inline:, not really related to the issue
>
> > On 21 Aug 2015, at 15:12, J-P Methot <jpmet...@gtcomm.net> wrote:
> >
> > Hi,
> >
> > First of all, we are sure that the return to the default configuration
> > fixed it. As soon as we restarted only one of the ceph nodes with the
> > default configuration, it sped up recovery tremedously. We had already
> > restarted before with the old conf and recovery was never that fast.
> >
> > Regarding the configuration, here's the old one with comments :
> >
> > [global]
> > fsid = *************************
> > mon_initial_members = cephmon1
> > mon_host = *******************
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true                               //
> >  Let's you use xattributes of xfs/ext4/btrfs filesystems
>
> This actually did the opposite, but this option doesn't exist anymore
>
> > osd_pool_default_pgp_num = 450                           //
> > default pgp number for new pools
> > osd_pg_bits = 12                                                      //
> >         number of bits used to designate pgps. Lets you have 2^12 pgps
>
> Could someone comment on those? What exactly does it do? What if I have
> more PGs than num_osds*osd_pg_bits?
>
> > osd_pool_default_size = 3                                       //
> >     default copy number for new pools
> > osd_pool_default_pg_num = 450                            //
> > default pg number for new pools
> > public_network = *************
> > cluster_network = ***************
> > osd_pgp_bits = 12                                               //
> >     number of bits used to designate pgps. Let's you have 2^12 pgps
> >
> > [osd]
> > filestore_queue_max_ops = 5000    // set to 500 by default Defines the
> > maximum number of in progress operations the file store accepts before
> > blocking on queuing new operations.
> > filestore_fd_cache_random = true        //          ????
>
> No docs, I don't see this in my ancient cluster :-)
>
> > journal_queue_max_ops = 1000000                       //           set
> > to 500 by default. Number of operations allowed in the journal queue
> > filestore_omap_header_cache_size = 1000000      //           Determines
> > the size of the LRU used to cache object omap headers. Larger values use
> > more memory but may reduce lookups on omap.
> > filestore_fd_cache_size = 1000000                         //
>
> You don't really need to set this so high, but not sure what the
> implications are if you go too high (it probably doesn't eat more memory
> until it opens so many files). If you have 4MB object on a 1TB drive than
> you really only need 250K to keep all files open.
> > not in the ceph documentation. Seems to be a common tweak for SSD
> > clusters though.
> > max_open_files = 1000000                                     //
> >  lets ceph set the max file descriptor in the OS to prevent running out
> > of file descriptors
>
> This is too low if you were really using all of the fd_cache. There are
> going to be thousands of tcp connection which need to be accounted for as
> well.
> (in my experience there can be hundreds to thousands tcp connection from
> just one RBD client and 200 OSDs, which is a lot).
>
>
> > osd_journal_size = 10000                                       //
> >    journal max size for each OSD
> >
> > New conf:
> >
> > [global]
> > fsid = *************************
> > mon_initial_members = cephmon1
> > mon_host = ************
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > public_network = ******************
> > cluster_network = ******************
> >
> > You might notice, I have a few undocumented settings in the old
> > configuration. These are settings I took from a certain openstack summit
> > presentation and they may have contributed to this whole problem. Here's
> > a list of settings that I think might be a possible cause for these
> > speed issues:
> >
> > filestore_fd_cache_random = true
> > filestore_fd_cache_size = 1000000
> >
> > Additionally, my colleague thinks these settings may have contributed :
> >
> > filestore_queue_max_ops = 5000
> > journal_queue_max_ops = 1000000
> >
> > We will do further tests on these settings once we have our lab ceph
> > test environment as we are also curious as to exactly what caused this.
> >
> >
> > On 2015-08-20 11:43 AM, Alex Gorbachev wrote:
> >>>
> >>> Just to update the mailing list, we ended up going back to default
> >>> ceph.conf without any additional settings than what is mandatory. We
> are
> >>> now reaching speeds we never reached before, both in recovery and in
> >>> regular usage. There was definitely something we set in the ceph.conf
> >>> bogging everything down.
> >>
> >> Could you please share the old and new ceph.conf, or the section that
> >> was removed?
> >>
> >> Best regards,
> >> Alex
> >>
> >>>
> >>>
> >>> On 2015-08-20 4:06 AM, Christian Balzer wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> from all the pertinent points by Somnath, the one about
> pre-conditioning
> >>>> would be pretty high on my list, especially if this slowness persists
> and
> >>>> nothing else (scrub) is going on.
> >>>>
> >>>> This might be "fixed" by doing a fstrim.
> >>>>
> >>>> Additionally the levelDB's per OSD are of course sync'ing heavily
> during
> >>>> reconstruction, so that might not be the favorite thing for your type
> of
> >>>> SSDs.
> >>>>
> >>>> But ultimately situational awareness is very important, as in "what"
> is
> >>>> actually going and slowing things down.
> >>>> As usual my recommendations would be to use atop, iostat or similar
> on all
> >>>> your nodes and see if your OSD SSDs are indeed the bottleneck or if
> it is
> >>>> maybe just one of them or something else entirely.
> >>>>
> >>>> Christian
> >>>>
> >>>> On Wed, 19 Aug 2015 20:54:11 +0000 Somnath Roy wrote:
> >>>>
> >>>>> Also, check if scrubbing started in the cluster or not. That may
> >>>>> considerably slow down the cluster.
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Somnath Roy
> >>>>> Sent: Wednesday, August 19, 2015 1:35 PM
> >>>>> To: 'J-P Methot'; ceph-us...@ceph.com
> >>>>> Subject: RE: [ceph-users] Bad performances in recovery
> >>>>>
> >>>>> All the writes will go through the journal.
> >>>>> It may happen your SSDs are not preconditioned well and after a lot
> of
> >>>>> writes during recovery IOs are stabilized to lower number. This is
> quite
> >>>>> common for SSDs if that is the case.
> >>>>>
> >>>>> Thanks & Regards
> >>>>> Somnath
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: J-P Methot [mailto:jpmet...@gtcomm.net]
> >>>>> Sent: Wednesday, August 19, 2015 1:03 PM
> >>>>> To: Somnath Roy; ceph-us...@ceph.com
> >>>>> Subject: Re: [ceph-users] Bad performances in recovery
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Thank you for the quick reply. However, we do have those exact
> settings
> >>>>> for recovery and it still strongly affects client io. I have looked
> at
> >>>>> various ceph logs and osd logs and nothing is out of the ordinary.
> >>>>> Here's an idea though, please tell me if I am wrong.
> >>>>>
> >>>>> We use intel SSDs for journaling and samsung SSDs as proper OSDs. As
> was
> >>>>> explained several times on this mailing list, Samsung SSDs suck in
> ceph.
> >>>>> They have horrible O_dsync speed and die easily, when used as
> journal.
> >>>>> That's why we're using Intel ssds for journaling, so that we didn't
> end
> >>>>> up putting 96 samsung SSDs in the trash.
> >>>>>
> >>>>> In recovery though, what is the ceph behaviour? What kind of write
> does
> >>>>> it do on the OSD SSDs? Does it write directly to the SSDs or through
> the
> >>>>> journal?
> >>>>>
> >>>>> Additionally, something else we notice: the ceph cluster is MUCH
> slower
> >>>>> after recovery than before. Clearly there is a bottleneck somewhere
> and
> >>>>> that bottleneck does not get cleared up after the recovery is done.
> >>>>>
> >>>>>
> >>>>> On 2015-08-19 3:32 PM, Somnath Roy wrote:
> >>>>>> If you are concerned about *client io performance* during recovery,
> >>>>>> use these settings..
> >>>>>>
> >>>>>> osd recovery max active = 1
> >>>>>> osd max backfills = 1
> >>>>>> osd recovery threads = 1
> >>>>>> osd recovery op priority = 1
> >>>>>>
> >>>>>> If you are concerned about *recovery performance*, you may want to
> >>>>>> bump this up, but I doubt it will help much from default settings..
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Somnath
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> Behalf
> >>>>>> Of J-P Methot
> >>>>>> Sent: Wednesday, August 19, 2015 12:17 PM
> >>>>>> To: ceph-us...@ceph.com
> >>>>>> Subject: [ceph-users] Bad performances in recovery
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Our setup is currently comprised of 5 OSD nodes with 12 OSD each,
> for
> >>>>>> a total of 60 OSDs. All of these are SSDs with 4 SSD journals on
> each.
> >>>>>> The ceph version is hammer v0.94.1 . There is a performance overhead
> >>>>>> because we're using SSDs (I've heard it gets better in infernalis,
> but
> >>>>>> we're not upgrading just yet) but we can reach numbers that I would
> >>>>>> consider "alright".
> >>>>>>
> >>>>>> Now, the issue is, when the cluster goes into recovery it's very
> fast
> >>>>>> at first, but then slows down to ridiculous levels as it moves
> >>>>>> forward. You can go from 7% to 2% to recover in ten minutes, but it
> >>>>>> may take 2 hours to recover the last 2%. While this happens, the
> >>>>>> attached openstack setup becomes incredibly slow, even though there
> is
> >>>>>> only a small fraction of objects still recovering (less than 1%).
> The
> >>>>>> settings that may affect recovery speed are very low, as they are by
> >>>>>> default, yet they still affect client io speed way more than it
> should.
> >>>>>>
> >>>>>> Why would ceph recovery become so slow as it progress and affect
> >>>>>> client io even though it's recovering at a snail's pace? And by a
> >>>>>> snail's pace, I mean a few kb/second on 10gbps uplinks. --
> >>>>>> ====================== Jean-Philippe Méthot
> >>>>>> Administrateur système / System administrator GloboTech
> Communications
> >>>>>> Phone: 1-514-907-0050
> >>>>>> Toll Free: 1-(888)-GTCOMM1
> >>>>>> Fax: 1-(514)-907-0750
> >>>>>> jpmet...@gtcomm.net
> >>>>>> http://www.gtcomm.net
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users@lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>> ________________________________
> >>>>>>
> >>>>>> PLEASE NOTE: The information contained in this electronic mail
> message
> >>>>>> is intended only for the use of the designated recipient(s) named
> >>>>>> above. If the reader of this message is not the intended recipient,
> >>>>>> you are hereby notified that you have received this message in error
> >>>>>> and that any review, dissemination, distribution, or copying of this
> >>>>>> message is strictly prohibited. If you have received this
> >>>>>> communication in error, please notify the sender by telephone or
> >>>>>> e-mail (as shown above) immediately and destroy any and all copies
> of
> >>>>>> this message in your possession (whether hard copies or
> electronically
> >>>>>> stored copies).
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> ======================
> >>>>> Jean-Philippe Méthot
> >>>>> Administrateur système / System administrator GloboTech
> Communications
> >>>>> Phone: 1-514-907-0050
> >>>>> Toll Free: 1-(888)-GTCOMM1
> >>>>> Fax: 1-(514)-907-0750
> >>>>> jpmet...@gtcomm.net
> >>>>> http://www.gtcomm.net
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> ======================
> >>> Jean-Philippe Méthot
> >>> Administrateur système / System administrator
> >>> GloboTech Communications
> >>> Phone: 1-514-907-0050
> >>> Toll Free: 1-(888)-GTCOMM1
> >>> Fax: 1-(514)-907-0750
> >>> jpmet...@gtcomm.net
> >>> http://www.gtcomm.net
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > --
> > ======================
> > Jean-Philippe Méthot
> > Administrateur système / System administrator
> > GloboTech Communications
> > Phone: 1-514-907-0050
> > Toll Free: 1-(888)-GTCOMM1
> > Fax: 1-(514)-907-0750
> > jpmet...@gtcomm.net
> > http://www.gtcomm.net
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Email:
 shin...@linux.com
 ski...@redhat.com

 Life w/ Linux <http://i-shinobu.hatenablog.com/>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bad performances in recovery

Reply via email to