Hi all, I'm adding a new OSD node with 36 OSDs to my cluster and have run into some problems. Here are some of the details of the cluster:
1 OSD node with 80 OSDs 1 EC pool with k=10, m=3 pg_num 1024 osd failure domain I added a second OSD node and started creating OSDs with ceph-deploy, one by one. The first 2 added fine, but each subsequent new OSD resulted in more and more PGs stuck activating. I've added a total of 14 new OSDs, but had to set 12 of those with a weight of 0 to get the cluster healthy and usable until I get it fixed. I have read some things about similar behavior due to PG overdose protection, but I don't think that's the case here because the failure domain is set to osd. Instead, I think my CRUSH rule need some attention: rule main-storage { id 1 type erasure min_size 3 max_size 13 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 0 type osd step emit } I don't believe I have modified anything from the automatically generated rule except for the addition of the hdd class. I have been reading the documentation on CRUSH rules, but am having trouble figuring out if the rule is setup properly. After a few more nodes are added I do want to change the failure domain to host, but osd is sufficient for now. Can anyone help out to see if the rule is causing the problems or if I should be looking at something else?
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com