The short answer is that uniform distribution is a lower priority feature
of the CRUSH hashing algorithm.

CRUSH is designed to be consistent and stable in it's hashing.  For the
details, you can read Sage's paper (
http://ceph.com/papers/weil-rados-pdsw07.pdf).  The goal is that if you
make a change to your cluster, there will be some moderate data movement,
but not everything moves.  If you then undo the change, things will go back
to exactly how they were before.

Doing that and getting uniform distribution is hard, and it's work in
progress.  The tunables are progress on this front, but they are by no
means the last word.


The current work around is to use ceph osd reweight-by-utilization.  That
tool will look at data distributions, and reweight things to bring the OSDs
more inline with each other.  Unfortunately, it does a ceph osd reweight,
not a ceph osd crush reweight.  (The existence of two different weighs with
different behavior is unfortunate too).  ceph osd reweight is temporary, in
that the value will be lost if a OSD is marked out.  ceph osd crush
reweight updates the CRUSHMAP, and it's not temporary.  So I use ceph osd
crush reweight manually.

While it would be nice if Ceph would automatically rebalance itself, I'd
turn that off.  Moving data around in my small cluster involves a major
performance hit.  By manually adjusting the crush weights, I have some
control over when and how much data is moved around.


I recommend taking a look a ceph osd tree and df on all nodes, and start
adjusting the crush weight of heavily used disks down, and under utilized
disks up.  The crush weight is generally the size (base2) of the disk in
TiB.  I adjust my OSDs up or down by 0.05 each step, then decide if I need
to make another pass. I have one 4 TiB drives with a weight of 4.14, and
another with a weight of 3.04.  They're still not balanced, but it's better.


If data migration has a smaller impact on your cluster, larger steps should
be fine.  And if anything causes major problems, just revert the change.
CRUSH is stable and consistent :-)




On Mon, Jan 5, 2015 at 2:04 AM, ivan babrou <ibob...@gmail.com> wrote:

> Hi!
>
> I have a cluster with 106 osds and disk usage is varying from 166gb to
> 316gb. Disk usage is highly correlated to number of pg per osd (no surprise
> here). Is there a reason for ceph to allocate more pg on some nodes?
>
> The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest are 87,
> 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with very
> little data has only 8 pgs. PG size in biggest pool is ~6gb (5.1..6.3
> actually).
>
> Lack of balanced disk usage prevents me from using all the disk space.
> When the biggest osd is full, cluster does not accept writes anymore.
>
> Here's gist with info about my cluster:
> https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
>
> --
> Regards, Ian Babrou
> http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to