Re: [ceph-users] Performance issues related to scrubbing

Cullen King Wed, 17 Feb 2016 08:21:42 -0800

On Wed, Feb 17, 2016 at 12:13 AM, Christian Balzer <ch...@gol.com> wrote:


>
> Hello,
>
> On Tue, 16 Feb 2016 10:46:32 -0800 Cullen King wrote:
>
> > Thanks for the helpful commentary Christian. Cluster is performing much
> > better with 50% more spindles (12 to 18 drives), along with setting scrub
> > sleep to 0.1. Didn't see really any gain from moving from the Samsung 850
> > Pro journal drives to Intel 3710's, even though dd and other direct tests
> > of the drives yielded much better results. rados bench with 4k requests
> > are still awfully low. I'll figure that problem out next.
> >
> Got examples, numbers, watched things with atop?
> 4KB rados benches are what can make my CPUs melt on the cluster here
> that's most similar to yours. ^o^
>
> > I ended up bumping up the number of placement groups from 512 to 1024
> > which should help a little bit. Basically it'll change the worst case
> > scrub performance such that it is distributed a little more across
> > drives rather than clustered on a single drive for longer.
> >
> Of course with osd_max_scrubs at its default of 1 there should never be
> more than one scrub per OSD.
> However I seem to vaguely remember that this is per "primary" scrub, so in
> case of deep-scrubs there could still be plenty of contention going on.
> Again, I've always had a good success with that manually kicked off scrub
> of all OSDs.
> It seems to sequence things nicely and finishes within 4 hours on my
> "good" production cluster.
>
> > I think the real solution here is to create a secondary SSD pool, pin
> > some radosgw buckets to it and put my thumbnail data on the smaller,
> > faster pool. I'll reserve the spindle based pool for original high res
> > photos, which are only read to create thumbnails when necessary. This
> > should put the majority of my random read IO on SSDs, and thumbnails
> > average 50kb each so it shouldn't be too spendy. I am considering trying
> > the newer samsung sm863 drives as we are read heavy, any potential data
> > loss on this thumbnail pool will not be catastrophic.
> >
> I seriously detest it when makers don't have they endurance data on the
> web page with all the other specifications and make you look up things in
> a slightly hidden PDF.
> Then giving the total endurance and making you calculate drive writes per
> day. ^o^
> Only to find that these have 3 DWPD, which is nothing to be ashamed off
> and should be fine for this particular use case.
>
> However take a look at this old posting of mine:
>
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
>
> With that in mind, I'd recommend you do some testing with real world data
> before you invest too much into something that will wear out long before
> it has payed for itself.
>

We are not write heavy at all, if my current drives are any indication I'd
only do one drive write per year on the things.


>
> Christian
>
> > Third, it seems that I am also running into the known "Lots Of Small
> > Files" performance issue. Looks like performance in my use case will be
> > drastically improved with the upcoming bluestore, though migrating to it
> > sounds painful!
> >
> > On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer <ch...@gol.com> wrote:
> >
> > >
> > > Hello,
> > >
> > > On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
> > >
> > > > Replies in-line:
> > > >
> > > > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> > > > <c-bal...@fusioncom.co.jp> wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I've been trying to nail down a nasty performance issue related
> > > > > > to scrubbing. I am mostly using radosgw with a handful of buckets
> > > > > > containing millions of various sized objects. When ceph scrubs,
> > > > > > both regular and deep, radosgw blocks on external requests, and
> > > > > > my cluster has a bunch of requests that have blocked for > 32
> > > > > > seconds. Frequently OSDs are marked down.
> > > > > >
> > > > > From my own (painful) experiences let me state this:
> > > > >
> > > > > 1. When your cluster runs out of steam during deep-scrubs, drop
> > > > > what you're doing and order more HW (OSDs).
> > > > > Because this is a sign that it would also be in trouble when doing
> > > > > recoveries.
> > > > >
> > > >
> > > > When I've initiated recoveries from working on the hardware the
> > > > cluster hasn't had a problem keeping up. It seems that it only has a
> > > > problem with scrubbing, meaning it feels like the IO pattern is
> > > > drastically different. I would think that with scrubbing I'd see
> > > > something closer to bursty sequential reads, rather than just
> > > > thrashing the drives with a more random IO pattern, especially given
> > > > our low cluster utilization.
> > > >
> > > It's probably more pronounced when phasing in/out entire OSDs, where it
> > > also has to read the entire (primary) data off it.
> > >
> > > >
> > > > >
> > > > > 2. If you cluster is inconvenienced by even mere scrubs, you're
> > > > > really in trouble.
> > > > > Threaten the penny pincher with bodily violence and have that new
> > > > > HW phased in yesterday.
> > > > >
> > > >
> > > > I am the penny pincher, biz owner, dev and ops guy for
> > > > http://ridewithgps.com :) More hardware isn't an issue, it just
> feels
> > > > pretty crazy to have this low of performance on a 12 OSD system.
> > > > Granted, that feeling isn't backed by anything concrete! In general,
> > > > I like to understand the problem before I solve it with hardware,
> > > > though I am definitely not averse to it. I already ordered 6 more
> > > > 4tb drives along with the new journal SSDs, anticipating the need.
> > > >
> > > > As you can see from the output of ceph status, we are not space
> > > > hungry by any means.
> > > >
> > >
> > > Well, in Ceph having just one OSD pegged to max will impact
> > > (eventually) everything when they need to read/write primary PGs on it.
> > >
> > > More below.
> > >
> > > >
> > > > >
> > > > > > According to atop, the OSDs being deep scrubbed are reading at
> > > > > > only 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group
> > > > > > takes 10-20 minutes.
> > > > > >
> > > > > > Here's a screenshot of atop from a node:
> > > > > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> > > > > >
> > > > > This looks familiar.
> > > > > Basically at this point in time the competing read request for all
> > > > > the objects clash with write requests and completely saturate your
> > > > > HD (about 120 IOPS and 85% busy according to your atop screenshot).
> > > > >
> > > >
> > > > In your experience would the scrub operation benefit from a bigger
> > > > readahead? Meaning is it more sequential than random reads? I already
> > > > bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.
> > > >
> > > I played with that long time ago (in benchmark scenarios) and didn't
> > > see any noticeable improvement.
> > > Deep-scrub might (fragmentation could hurt it though), regular scrub
> > > not so much.
> > >
> > > > About half of our reads are on objects with an average size of 40kb
> > > > (map thumbnails), and the other half are on photo thumbs with a size
> > > > between 10kb and 150kb.
> > > >
> > >
> > > Noted, see below.
> > >
> > > > After doing a little more researching, I came across this:
> > > >
> > > >
> > >
> http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage
> > > >
> > > > Sounds like I am probably running into issues with lots of random
> > > > read IO, combined with known issues around small files. To give an
> > > > idea, I have about 15 million small map thumbnails stored in my two
> > > > largest buckets, and I am pushing out about 30 requests per second
> > > > right now from those two buckets.
> > > >
> > > This is certainly a factor, but that knowledge of a future improvement
> > > won't help you with your current problem of course. ^_-
> > >
> > > >
> > > >
> > > > > There are ceph configuration options that can mitigate this to some
> > > > > extend and which I don't see in your config, like
> > > > > "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the
> > > > > various IO priority settings.
> > > > > However the points above still stand.
> > > > >
> > > >
> > > > Yes, I have a running series of notes of config options to try out,
> > > > just wanted to touch base with other community members before
> > > > shooting in the dark.
> > > >
> > > osd_scrub_sleep is probably the most effective immediately available
> > > option for you to prevent slow, stalled IO.
> > > At the obvious cost of scrubs taking even longer.
> > > There is of course also the option to disable scrubs entirely until
> > > your HW has been upgraded.
> > >
> > > >
> > > > >
> > > > > XFS defragmentation might help, significantly if your FS is badly
> > > > > fragmented. But again, this is only a temporary band-aid.
> > > > >
> > > > > > First question: is this a reasonable speed for scrubbing, given a
> > > > > > very lightly used cluster? Here's some cluster details:
> > > > > >
> > > > > > deploy@drexler:~$ ceph --version
> > > > > > ceph version 0.94.1-5-g85a68f9
> > > > > > (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)
> > > > > >
> > > > > >
> > > > > > 2x Xeon E5-2630 per node, 64gb of ram per node.
> > > > > >
> > > > > More memory can help by keeping hot objects in the page cache (so
> > > > > the actual disks need not be read and can write at their full IOPS
> > > > > capacity). A lot of memory (and the correct sysctl settings) will
> > > > > also allow for a large SLAB space, keeping all those directory
> > > > > entries and other bits in memory without having to go to disk to
> > > > > get them.
> > > > >
> > > > > You seem to be just fine CPU wise.
> > > > >
> > > >
> > > > I thought about bumping each node up to 128gb of ram as another cheap
> > > > insurance policy. I'll try that after the other changes. I'd like to
> > > > know why so I'll try and change one thing at a time, though I am
> > > > also just eager to have this thing stable.
> > > >
> > >
> > > For me everything was sweet and dandy as long all the really hot
> > > objects did fit in the page cache and the FS bits where all in SLAB
> > > (no need to go to disk for a "ls -R").
> > >
> > > Past the point it all went to molasses land "quickly".
> > >
> > > >
> > > > >
> > > > > >
> > > > > > deploy@drexler:~$ ceph status
> > > > > >     cluster 234c6825-0e2b-4256-a710-71d29f4f023e
> > > > > >      health HEALTH_WARN
> > > > > >             118 requests are blocked > 32 sec
> > > > > >      monmap e1: 3 mons at {drexler=
> > > > > > 10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0}
> > > > > >             election epoch 296, quorum 0,1,2 paley,drexler,lucy
> > > > > >      mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby
> > > > > >      osdmap e1115: 12 osds: 12 up, 12 in
> > > > > >       pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493
> > > > > > kobjects 10060 GB used, 34629 GB / 44690 GB avail
> > > > > >                 1422 active+clean
> > > > > >                    1 active+clean+scrubbing+deep
> > > > > >                    1 active+clean+scrubbing
> > > > > >   client io 721 kB/s rd, 33398 B/s wr, 53 op/s
> > > > > >
> > > > > You want to avoid having scrubs going on willy-nilly in parallel
> > > > > and at high peek times, even IF your cluster is capable of
> > > > > handling them.
> > > > >
> > > > > Depending on how busy your cluster is and its usage pattern, you
> > > > > may do what I did.
> > > > > Kick off a deep scrub of all OSDs "ceph osd deep-scrub \*" like
> > > > > 01:00 on a Saturday morning.
> > > > > If your cluster is fast enough, it will finish before 07:00
> > > > > (without killing your client performance) and all regular scrubs
> > > > > will now happen in that time frame as well (given default
> > > > > settings). If your cluster isn't fast enough, see my initial 2
> > > > > points. ^o^
> > > > >
> > > >
> > > > The problem is our cluster is the image and upload store for our site
> > > > which is a reasonably busy site international site. We have about
> > > > 60% of our customers in North America, and 30% or so in Europe and
> > > > Asia. We definitely would be better off with more scrubs between
> > > > 11pm and 7am -8 to 0 GMT, though we can't afford to slam the cluster.
> > > >
> > > > I suppose that our cluster is a much more random mix of reads than
> > > > many others using ceph as a RBD store. Operating systems probably
> > > > have a stronger mix of sequential reads, whereas our users are
> > > > concurrently viewing different pages with different images, a more
> > > > random workload.
> > > >
> > > > It sounds like we have to maintain a cluster storage capacity of less
> > > > than 25% in order to have reasonable performance. I guess this makes
> > > > sense, we have much more random IO needs than storage needs.
> > > >
> > > In your use case (and most others) random IOPS tends to be the
> > > bottleneck long long before either space or sequential bandwidth
> > > becomes and issues.
> > >
> > > More spindles, more IOPS. See below. ^o^
> > >
> > > >
> > > > >
> > > > > > deploy@drexler:~$ ceph osd tree
> > > > > > ID WEIGHT   TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > > > > > -1 43.67999 root default
> > > > > > -2 14.56000     host paley
> > > > > >  0  3.64000         osd.0         up  1.00000          1.00000
> > > > > >  3  3.64000         osd.3         up  1.00000          1.00000
> > > > > >  6  3.64000         osd.6         up  1.00000          1.00000
> > > > > >  9  3.64000         osd.9         up  1.00000          1.00000
> > > > > > -3 14.56000     host lucy
> > > > > >  1  3.64000         osd.1         up  1.00000          1.00000
> > > > > >  4  3.64000         osd.4         up  1.00000          1.00000
> > > > > >  7  3.64000         osd.7         up  1.00000          1.00000
> > > > > > 11  3.64000         osd.11        up  1.00000          1.00000
> > > > > > -4 14.56000     host drexler
> > > > > >  2  3.64000         osd.2         up  1.00000          1.00000
> > > > > >  5  3.64000         osd.5         up  1.00000          1.00000
> > > > > >  8  3.64000         osd.8         up  1.00000          1.00000
> > > > > > 10  3.64000         osd.10        up  1.00000          1.00000
> > > > > >
> > > > > >
> > > > > > My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with
> > > > > > Samsung 850 Pro journals (very slow, ordered s3700 replacements,
> > > > > > but shouldn't pose problems for reading as far as I understand
> > > > > > things).
> > > > >
> > > > > Just to make sure, these are genuine DeskStars?
> > > > > I'm asking both because AFAIK they are out of production and their
> > > > > direct successors, the Toshiba DT drives (can) have a nasty
> > > > > firmware bug that totally ruins their performance (from ~8 hours
> > > > > per week to permanently until power-cycled).
> > > > >
> > > >
> > > > These are original deskstars. Didn't realize they weren't in
> > > > production, I just grabbed 6 more of the Hitachi DeskStar NAS
> > > > edition 4tb drives, which are readily available. I probably should
> > > > have ordered 6tb drives, as I'd end up with better seek times due to
> > > > them not being fully utilized - the data would reside closer to the
> > > > center of the platters.
> > > >
> > > Ah, Deskstar NAS, yes, these still are in production.
> > >
> > > I'd get more, smaller, faster HDDs instead.
> > > HW cache on your controller can also help (depends on the model/FW if
> > > it is used efficiently in JBOD mode).
> > >
> > > And since your space utilization is small (though of course that can
> > > and will change over time of course), you may very well benefit from
> > > going SSD.
> > >
> > > SSD pools if you think you can fit (economically) a set of your high
> > > access data like the thumbnails on it.
> > >
> > > SSD cache tiers are a bit more dubious when comes to rewards, but that
> > > depends a lot on the hot data set.
> > > Plenty of discussion in here about that.
> > >
> > > Regards,
> > >
> > > Christian
> > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Christian
> > > > > > MONs are co-located
> > > > > > with OSD nodes, but the nodes are fairly beefy and has very low
> > > > > > load. Drives are on a expanding backplane, with an LSI SAS3008
> > > > > > controller.
> > > > > >
> > > > > > I have a fairly standard config as well:
> > > > > >
> > > > > > https://gist.github.com/kingcu/aae7373eb62ceb7579da
> > > > > >
> > > > > > I know that I don't have a ton of OSDs, but I'd expect a little
> > > > > > better performance than this. Checkout munin of my three nodes:
> > > > > >
> > > > > >
> > > > >
> > >
> http://munin.ridewithgps.com/ridewithgps.com/drexler.ridewithgps.com/index.html#disk
> > > > > >
> > > > >
> > >
> http://munin.ridewithgps.com/ridewithgps.com/paley.ridewithgps.com/index.html#disk
> > > > > >
> > > > >
> > >
> http://munin.ridewithgps.com/ridewithgps.com/lucy.ridewithgps.com/index.html#disk
> > > > > >
> > > > > >
> > > > > > Any input would be appreciated, before I start trying to
> > > > > > micro-optimize config params, as well as upgrading to Infernalis.
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Cullen
> > > > >
> > > > >
> > > > > --
> > > > > Christian Balzer        Network/Systems Engineer
> > > > > ch...@gol.com           Global OnLine Japan/Rakuten Communications
> > > > > http://www.gol.com/
> > > > >
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > ch...@gol.com           Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance issues related to scrubbing

Reply via email to