Hi Oliver, 

Have you tried tuning some of the cluster settings to fix the IO errors
in the VMs? 

We found some of the same issues when reweighting, backfilling and
removing large snapshots. By minimizing the number of concurrent
backfills and prioritizing client IO we can now add/remove OSDs without
the VMs throwing those nasty IO errors. 

We have been running a 3 node cluster for about a year now on Hammer
with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD
images. 

Here are the things we changed: 

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1' 

Recovery may take a little longer while backfilling, but the cluster is
still responsive and we have happy VMs now. 

I've collected these from various posts from the ceph-users list. 

Maybe they will help you if you haven't tried them already. 

Chris 

On 2016-04-07 4:18 am, Oliver Dzombic wrote: 

> Hi Christian,
> 
> thank you for answering, i appriciate your time !
> 
> ---
> 
> Its used for RBD hosted vm's and also cephfs hosted vm's.
> 
> Well the basic problem is/was that single OSD's simply go out/down.
> Ending in SATA BUS error's for the VM's which have to be rebooted, if
> they anyway can, because as long as OSD's are missing in that szenario,
> the customer cant start their vm's.
> 
> Installing/checking munin discovered a very high drive utilization. And
> this way simply an overload of the cluster.
> 
> The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
> SSD for journal.
> 
> So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
> for journal ). And, as first aid, reducing the replication from 3 to 2
> to reduce the (write) load of the cluster.
> 
> I planed to wait until the new LTS is out, but i already added now
> another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
> tier cache ( changing strategy and increasing the number of drives while
> reducing the size - was an design mistake from me ).
> 
> osdmap e31602: 28 osds: 28 up, 28 in
> flags noscrub,nodeep-scrub
> pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
> 39270 GB used, 88290 GB / 124 TB avail
> 1428 active+clean
> 
> The range goes from 200 op/s to around 5000 op/s.
> 
> The current avarage drive utilization is 20-30%.
> 
> If we have backfill ( osd out/down ) or reweight the utilization of HDD
> drives is streight 90-100%.
> 
> Munin shows on all drives ( except the SSD's ) a dislatency of avarage
> 170 ms. A minumum of 80-130 ms, and a maximum of 300-600ms.
> 
> Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
> are, together with most of the VM's in datacenter B.
> 
> I am currently cleaning the 4 initial nodes by doing
> 
> ceph osd reweight to peut a peut reducing the usage, to remove the osd's
> completely from there and just keeping up the monitors.
> 
> The complete cluster have to move to one single datacenter together with
> all VM's.
> 
> ---
> 
> I am reducing the number of nodes because out of administrative view,
> its not very handy. I prefere extending the hardware power in terms of
> CPU, RAM and HDD.
> 
> So the endcluster will look like:
> 
> 3x OSD Nodes, each:
> 
> 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
> to connect to external JBOD servers holding the cold storage HDD's.
> Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.
> 
> I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
> very useful in a ceph environment. But then again, maybe with a cache
> tier the impact/difference is not really that big.
> 
> That together with Samsung SM863 240 GB SSD's for journal and cache
> tier, connected to the board directly or to a seperated Adaptec HBA
> 1000-16i.
> 
> So far the current idea/theory/plan.
> 
> ---
> 
> But to that point, its a long road. Last night i was doing a reweight of
> 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
> had to restart the osd. ( with again IO errors in some of the vm's ).
> 
> So based on your article, the cache tier solved your problem, and i
> think i have basically the same.
> 
> ---
> 
> So a very good hint is, to activate the whole tier cache in the night,
> when things are a bit more smooth.
> 
> Any suggestions / critics / advices are highly welcome :-)
> 
> Thank you!
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> Am 07.04.2016 um 05:32 schrieb Christian Balzer: 
> Hello,
> 
> On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote:
> 
> Hi,
> 
> i have some IO issues, and after Christian's great article/hint about
> caches i plan to add caches too.
> 
> Thanks, version 2 is still a work in progress, as I keep running into
> unknowns. 
> 
> IO issues in what sense, like in too many write IOPS for the current HW to
> sustain? 
> Also, what are you using Ceph for, RBD hosting VM images?
> 
> It will help you a lot if you can identify and quantify the usage patterns
> (including a rough idea on how many hot objects you have) and where you
> run into limits.
> 
> So now comes the troublesome question:
> 
> How much dangerous is it to add cache tiers in an existing cluster with
> around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes.
> 
> You're reducing nodes? Why? 
> More nodes/OSDs equates to more IOPS in general.
> 
> 40TB is a sizable amount of data, how many objects does you cluster hold?
> Also is that raw data or after replication (size 3?)?
> In short, "ceph -s" output please. ^.^
> 
> I mean will just everything explode and i just die, or how is the road
> map to introduce this, after you have an already running cluster ?
> 
> That's pretty much straightforward from the Ceph docs at:
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ [1]
> (replace master with hammer if you're running that)
> 
> Nothing happens until the "set-overlay" bit and you will want to configure
> all the pertinent bits before that.
> 
> A basic question is if you will have dedicated SSD cache tier hosts or
> have the SSDs holding the cache pool in your current hosts.
> Dedicated hosts have the advantage matched HW, CPU power the SSDs and
> simpler configuration, shared hosts can have the advantage of spreading
> the network load further out instead of having everything going through
> the cache tier nodes.
> 
> The size and length of the explosion will entirely depend on:
> 1) how capable your current cluster is, how (over)loaded it is.
> 2) the actual load/usage at the time you phase the cache tier in
> 3) the amount of "truly hot" objects you have.
> 
> As I wrote here:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html 
> [2]
> 
> In my case with a BADLY overload base pool and a constant stream of
> log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized after
> 10 minutes.
> 
> Truly hot objects as mentioned above will be those (in the case of VM
> images) holding active directory inodes and files.
> 
> Anything that needs to be considered ? Dangerous no-no's ?
> 
> Also it will happen, that i have to add the cache tiers server by
> server, and not all at the same time.
> 
> You want at least 2 cache tier servers from the start and well known,
> well tested (LSI timeouts!) SSDs in them.
> 
> Christian
> 
> I am happy for any kind of advice.
> 
> Thank you !
 _______________________________________________
 ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3] 

Links:
------
[1] http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
[2]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html
[3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to