[ceph-users] PGs stuck activating after adding new OSDs

Jon Light Tue, 27 Mar 2018 11:08:09 -0700

Hi all,

I'm adding a new OSD node with 36 OSDs to my cluster and have run into some
problems. Here are some of the details of the cluster:


1 OSD node with 80 OSDs
1 EC pool with k=10, m=3
pg_num 1024
osd failure domain

I added a second OSD node and started creating OSDs with ceph-deploy, one
by one. The first 2 added fine, but each subsequent new OSD resulted in
more and more PGs stuck activating. I've added a total of 14 new OSDs, but
had to set 12 of those with a weight of 0 to get the cluster healthy and
usable until I get it fixed.

I have read some things about similar behavior due to PG overdose
protection, but I don't think that's the case here because the failure
domain is set to osd. Instead, I think my CRUSH rule need some attention:

rule main-storage {
        id 1
        type erasure
        min_size 3
        max_size 13
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type osd
        step emit
}

I don't believe I have modified anything from the automatically generated
rule except for the addition of the hdd class.

I have been reading the documentation on CRUSH rules, but am having trouble
figuring out if the rule is setup properly. After a few more nodes are
added I do want to change the failure domain to host, but osd is sufficient
for now.

Can anyone help out to see if the rule is causing the problems or if I
should be looking at something else?

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PGs stuck activating after adding new OSDs

Reply via email to