On Wed, Feb 17, 2016 at 12:13 AM, Christian Balzer <ch...@gol.com> wrote:
> > Hello, > > On Tue, 16 Feb 2016 10:46:32 -0800 Cullen King wrote: > > > Thanks for the helpful commentary Christian. Cluster is performing much > > better with 50% more spindles (12 to 18 drives), along with setting scrub > > sleep to 0.1. Didn't see really any gain from moving from the Samsung 850 > > Pro journal drives to Intel 3710's, even though dd and other direct tests > > of the drives yielded much better results. rados bench with 4k requests > > are still awfully low. I'll figure that problem out next. > > > Got examples, numbers, watched things with atop? > 4KB rados benches are what can make my CPUs melt on the cluster here > that's most similar to yours. ^o^ > > > I ended up bumping up the number of placement groups from 512 to 1024 > > which should help a little bit. Basically it'll change the worst case > > scrub performance such that it is distributed a little more across > > drives rather than clustered on a single drive for longer. > > > Of course with osd_max_scrubs at its default of 1 there should never be > more than one scrub per OSD. > However I seem to vaguely remember that this is per "primary" scrub, so in > case of deep-scrubs there could still be plenty of contention going on. > Again, I've always had a good success with that manually kicked off scrub > of all OSDs. > It seems to sequence things nicely and finishes within 4 hours on my > "good" production cluster. > > > I think the real solution here is to create a secondary SSD pool, pin > > some radosgw buckets to it and put my thumbnail data on the smaller, > > faster pool. I'll reserve the spindle based pool for original high res > > photos, which are only read to create thumbnails when necessary. This > > should put the majority of my random read IO on SSDs, and thumbnails > > average 50kb each so it shouldn't be too spendy. I am considering trying > > the newer samsung sm863 drives as we are read heavy, any potential data > > loss on this thumbnail pool will not be catastrophic. > > > I seriously detest it when makers don't have they endurance data on the > web page with all the other specifications and make you look up things in > a slightly hidden PDF. > Then giving the total endurance and making you calculate drive writes per > day. ^o^ > Only to find that these have 3 DWPD, which is nothing to be ashamed off > and should be fine for this particular use case. > > However take a look at this old posting of mine: > > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html > > With that in mind, I'd recommend you do some testing with real world data > before you invest too much into something that will wear out long before > it has payed for itself. > We are not write heavy at all, if my current drives are any indication I'd only do one drive write per year on the things. > > Christian > > > Third, it seems that I am also running into the known "Lots Of Small > > Files" performance issue. Looks like performance in my use case will be > > drastically improved with the upcoming bluestore, though migrating to it > > sounds painful! > > > > On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer <ch...@gol.com> wrote: > > > > > > > > Hello, > > > > > > On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote: > > > > > > > Replies in-line: > > > > > > > > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer > > > > <c-bal...@fusioncom.co.jp> wrote: > > > > > > > > > > > > > > Hello, > > > > > > > > > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I've been trying to nail down a nasty performance issue related > > > > > > to scrubbing. I am mostly using radosgw with a handful of buckets > > > > > > containing millions of various sized objects. When ceph scrubs, > > > > > > both regular and deep, radosgw blocks on external requests, and > > > > > > my cluster has a bunch of requests that have blocked for > 32 > > > > > > seconds. Frequently OSDs are marked down. > > > > > > > > > > > From my own (painful) experiences let me state this: > > > > > > > > > > 1. When your cluster runs out of steam during deep-scrubs, drop > > > > > what you're doing and order more HW (OSDs). > > > > > Because this is a sign that it would also be in trouble when doing > > > > > recoveries. > > > > > > > > > > > > > When I've initiated recoveries from working on the hardware the > > > > cluster hasn't had a problem keeping up. It seems that it only has a > > > > problem with scrubbing, meaning it feels like the IO pattern is > > > > drastically different. I would think that with scrubbing I'd see > > > > something closer to bursty sequential reads, rather than just > > > > thrashing the drives with a more random IO pattern, especially given > > > > our low cluster utilization. > > > > > > > It's probably more pronounced when phasing in/out entire OSDs, where it > > > also has to read the entire (primary) data off it. > > > > > > > > > > > > > > > > > 2. If you cluster is inconvenienced by even mere scrubs, you're > > > > > really in trouble. > > > > > Threaten the penny pincher with bodily violence and have that new > > > > > HW phased in yesterday. > > > > > > > > > > > > > I am the penny pincher, biz owner, dev and ops guy for > > > > http://ridewithgps.com :) More hardware isn't an issue, it just > feels > > > > pretty crazy to have this low of performance on a 12 OSD system. > > > > Granted, that feeling isn't backed by anything concrete! In general, > > > > I like to understand the problem before I solve it with hardware, > > > > though I am definitely not averse to it. I already ordered 6 more > > > > 4tb drives along with the new journal SSDs, anticipating the need. > > > > > > > > As you can see from the output of ceph status, we are not space > > > > hungry by any means. > > > > > > > > > > Well, in Ceph having just one OSD pegged to max will impact > > > (eventually) everything when they need to read/write primary PGs on it. > > > > > > More below. > > > > > > > > > > > > > > > > > > According to atop, the OSDs being deep scrubbed are reading at > > > > > > only 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group > > > > > > takes 10-20 minutes. > > > > > > > > > > > > Here's a screenshot of atop from a node: > > > > > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png > > > > > > > > > > > This looks familiar. > > > > > Basically at this point in time the competing read request for all > > > > > the objects clash with write requests and completely saturate your > > > > > HD (about 120 IOPS and 85% busy according to your atop screenshot). > > > > > > > > > > > > > In your experience would the scrub operation benefit from a bigger > > > > readahead? Meaning is it more sequential than random reads? I already > > > > bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb. > > > > > > > I played with that long time ago (in benchmark scenarios) and didn't > > > see any noticeable improvement. > > > Deep-scrub might (fragmentation could hurt it though), regular scrub > > > not so much. > > > > > > > About half of our reads are on objects with an average size of 40kb > > > > (map thumbnails), and the other half are on photo thumbs with a size > > > > between 10kb and 150kb. > > > > > > > > > > Noted, see below. > > > > > > > After doing a little more researching, I came across this: > > > > > > > > > > > > http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage > > > > > > > > Sounds like I am probably running into issues with lots of random > > > > read IO, combined with known issues around small files. To give an > > > > idea, I have about 15 million small map thumbnails stored in my two > > > > largest buckets, and I am pushing out about 30 requests per second > > > > right now from those two buckets. > > > > > > > This is certainly a factor, but that knowledge of a future improvement > > > won't help you with your current problem of course. ^_- > > > > > > > > > > > > > > > > There are ceph configuration options that can mitigate this to some > > > > > extend and which I don't see in your config, like > > > > > "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the > > > > > various IO priority settings. > > > > > However the points above still stand. > > > > > > > > > > > > > Yes, I have a running series of notes of config options to try out, > > > > just wanted to touch base with other community members before > > > > shooting in the dark. > > > > > > > osd_scrub_sleep is probably the most effective immediately available > > > option for you to prevent slow, stalled IO. > > > At the obvious cost of scrubs taking even longer. > > > There is of course also the option to disable scrubs entirely until > > > your HW has been upgraded. > > > > > > > > > > > > > > > > > XFS defragmentation might help, significantly if your FS is badly > > > > > fragmented. But again, this is only a temporary band-aid. > > > > > > > > > > > First question: is this a reasonable speed for scrubbing, given a > > > > > > very lightly used cluster? Here's some cluster details: > > > > > > > > > > > > deploy@drexler:~$ ceph --version > > > > > > ceph version 0.94.1-5-g85a68f9 > > > > > > (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8) > > > > > > > > > > > > > > > > > > 2x Xeon E5-2630 per node, 64gb of ram per node. > > > > > > > > > > > More memory can help by keeping hot objects in the page cache (so > > > > > the actual disks need not be read and can write at their full IOPS > > > > > capacity). A lot of memory (and the correct sysctl settings) will > > > > > also allow for a large SLAB space, keeping all those directory > > > > > entries and other bits in memory without having to go to disk to > > > > > get them. > > > > > > > > > > You seem to be just fine CPU wise. > > > > > > > > > > > > > I thought about bumping each node up to 128gb of ram as another cheap > > > > insurance policy. I'll try that after the other changes. I'd like to > > > > know why so I'll try and change one thing at a time, though I am > > > > also just eager to have this thing stable. > > > > > > > > > > For me everything was sweet and dandy as long all the really hot > > > objects did fit in the page cache and the FS bits where all in SLAB > > > (no need to go to disk for a "ls -R"). > > > > > > Past the point it all went to molasses land "quickly". > > > > > > > > > > > > > > > > > > > > > > > > deploy@drexler:~$ ceph status > > > > > > cluster 234c6825-0e2b-4256-a710-71d29f4f023e > > > > > > health HEALTH_WARN > > > > > > 118 requests are blocked > 32 sec > > > > > > monmap e1: 3 mons at {drexler= > > > > > > 10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0} > > > > > > election epoch 296, quorum 0,1,2 paley,drexler,lucy > > > > > > mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby > > > > > > osdmap e1115: 12 osds: 12 up, 12 in > > > > > > pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493 > > > > > > kobjects 10060 GB used, 34629 GB / 44690 GB avail > > > > > > 1422 active+clean > > > > > > 1 active+clean+scrubbing+deep > > > > > > 1 active+clean+scrubbing > > > > > > client io 721 kB/s rd, 33398 B/s wr, 53 op/s > > > > > > > > > > > You want to avoid having scrubs going on willy-nilly in parallel > > > > > and at high peek times, even IF your cluster is capable of > > > > > handling them. > > > > > > > > > > Depending on how busy your cluster is and its usage pattern, you > > > > > may do what I did. > > > > > Kick off a deep scrub of all OSDs "ceph osd deep-scrub \*" like > > > > > 01:00 on a Saturday morning. > > > > > If your cluster is fast enough, it will finish before 07:00 > > > > > (without killing your client performance) and all regular scrubs > > > > > will now happen in that time frame as well (given default > > > > > settings). If your cluster isn't fast enough, see my initial 2 > > > > > points. ^o^ > > > > > > > > > > > > > The problem is our cluster is the image and upload store for our site > > > > which is a reasonably busy site international site. We have about > > > > 60% of our customers in North America, and 30% or so in Europe and > > > > Asia. We definitely would be better off with more scrubs between > > > > 11pm and 7am -8 to 0 GMT, though we can't afford to slam the cluster. > > > > > > > > I suppose that our cluster is a much more random mix of reads than > > > > many others using ceph as a RBD store. Operating systems probably > > > > have a stronger mix of sequential reads, whereas our users are > > > > concurrently viewing different pages with different images, a more > > > > random workload. > > > > > > > > It sounds like we have to maintain a cluster storage capacity of less > > > > than 25% in order to have reasonable performance. I guess this makes > > > > sense, we have much more random IO needs than storage needs. > > > > > > > In your use case (and most others) random IOPS tends to be the > > > bottleneck long long before either space or sequential bandwidth > > > becomes and issues. > > > > > > More spindles, more IOPS. See below. ^o^ > > > > > > > > > > > > > > > > > > deploy@drexler:~$ ceph osd tree > > > > > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > > > > > > -1 43.67999 root default > > > > > > -2 14.56000 host paley > > > > > > 0 3.64000 osd.0 up 1.00000 1.00000 > > > > > > 3 3.64000 osd.3 up 1.00000 1.00000 > > > > > > 6 3.64000 osd.6 up 1.00000 1.00000 > > > > > > 9 3.64000 osd.9 up 1.00000 1.00000 > > > > > > -3 14.56000 host lucy > > > > > > 1 3.64000 osd.1 up 1.00000 1.00000 > > > > > > 4 3.64000 osd.4 up 1.00000 1.00000 > > > > > > 7 3.64000 osd.7 up 1.00000 1.00000 > > > > > > 11 3.64000 osd.11 up 1.00000 1.00000 > > > > > > -4 14.56000 host drexler > > > > > > 2 3.64000 osd.2 up 1.00000 1.00000 > > > > > > 5 3.64000 osd.5 up 1.00000 1.00000 > > > > > > 8 3.64000 osd.8 up 1.00000 1.00000 > > > > > > 10 3.64000 osd.10 up 1.00000 1.00000 > > > > > > > > > > > > > > > > > > My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with > > > > > > Samsung 850 Pro journals (very slow, ordered s3700 replacements, > > > > > > but shouldn't pose problems for reading as far as I understand > > > > > > things). > > > > > > > > > > Just to make sure, these are genuine DeskStars? > > > > > I'm asking both because AFAIK they are out of production and their > > > > > direct successors, the Toshiba DT drives (can) have a nasty > > > > > firmware bug that totally ruins their performance (from ~8 hours > > > > > per week to permanently until power-cycled). > > > > > > > > > > > > > These are original deskstars. Didn't realize they weren't in > > > > production, I just grabbed 6 more of the Hitachi DeskStar NAS > > > > edition 4tb drives, which are readily available. I probably should > > > > have ordered 6tb drives, as I'd end up with better seek times due to > > > > them not being fully utilized - the data would reside closer to the > > > > center of the platters. > > > > > > > Ah, Deskstar NAS, yes, these still are in production. > > > > > > I'd get more, smaller, faster HDDs instead. > > > HW cache on your controller can also help (depends on the model/FW if > > > it is used efficiently in JBOD mode). > > > > > > And since your space utilization is small (though of course that can > > > and will change over time of course), you may very well benefit from > > > going SSD. > > > > > > SSD pools if you think you can fit (economically) a set of your high > > > access data like the thumbnails on it. > > > > > > SSD cache tiers are a bit more dubious when comes to rewards, but that > > > depends a lot on the hot data set. > > > Plenty of discussion in here about that. > > > > > > Regards, > > > > > > Christian > > > > > > > > > > > > > > Regards, > > > > > > > > > > Christian > > > > > > MONs are co-located > > > > > > with OSD nodes, but the nodes are fairly beefy and has very low > > > > > > load. Drives are on a expanding backplane, with an LSI SAS3008 > > > > > > controller. > > > > > > > > > > > > I have a fairly standard config as well: > > > > > > > > > > > > https://gist.github.com/kingcu/aae7373eb62ceb7579da > > > > > > > > > > > > I know that I don't have a ton of OSDs, but I'd expect a little > > > > > > better performance than this. Checkout munin of my three nodes: > > > > > > > > > > > > > > > > > > > > > http://munin.ridewithgps.com/ridewithgps.com/drexler.ridewithgps.com/index.html#disk > > > > > > > > > > > > > > > http://munin.ridewithgps.com/ridewithgps.com/paley.ridewithgps.com/index.html#disk > > > > > > > > > > > > > > > http://munin.ridewithgps.com/ridewithgps.com/lucy.ridewithgps.com/index.html#disk > > > > > > > > > > > > > > > > > > Any input would be appreciated, before I start trying to > > > > > > micro-optimize config params, as well as upgrading to Infernalis. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Cullen > > > > > > > > > > > > > > > -- > > > > > Christian Balzer Network/Systems Engineer > > > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > > > http://www.gol.com/ > > > > > > > > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com