Hello Alexander, One other point on your email.. You indicate you desire each OSD to have ~100 PGs, but depending on your pool size, it seems you may have forgetting about the additional PGs associated with replication itself.
Assuming 3x replication in your environment: 70,000 * 3 ------------ 800 OSDs =~ 262.5 PGs per OSD on average While this PG to OSD ratio shouldn't cause significant pain, I would not go to any higher PG count without adding more spindles. For more specific PG count guidance and modeling, please see: http://ceph.com/pgcalc Hope this helps, Michael J. Kidd Sr. Storage Consultant Red Hat Global Storage Consulting +1 919-442-8878 On Wed, Sep 23, 2015 at 8:34 AM, Sage Weil <s...@newdream.net> wrote: > On Wed, 23 Sep 2015, Alexander Yang wrote: > > hello, > > We use Ceph+Openstack in our private cloud. In our cluster, we > have > > 5 mons and 800 osds, the Capacity is about 1Pb. And run about 700 vms and > > 1100 volumes, > > recently, we increase our pg_num , now the cluster have about > 70000 > > pgs. In my real intention? I want every osd have 100pgs. but after > increase > > pg_num, I find I'm wrong. Because the different crush weight for > different > > osd, the osd's pg_num is different, some osd have exceed 500pgs. > > Now, the problem is appear?cause some reason when i want to > change > > some osd weight, that means change the crushmap. This change cause > about > > 0.03% data to migrate. the mon is always begin to election. It's will > hung > > the cluster, and when they end, the original leader still is the new > > leader. And during the mon eclection?On the upper layer, vm have too many > > slow request will appear. so now i dare to do any operation about change > > crushmap. But i worry about an important thing, If when our cluster > down > > one host even down one rack. By the time, the cluster curshmap will > > change large, and the migrate data also large. I worry the cluster will > > hung long time. and result on upper layer, all vm became to shutdown. > > In my opinion, I guess when I change the crushmap,* the leader > mon > > maybe calculate the too many information*, or* too many client want to > get > > the new crushmap from leader mon*. It must be hung the mon thread, so > the > > leader mon can't heatbeat to other mons, the other mons think the leader > is > > down then begin the new election. I am sorry if i guess is wrong. > > The crushmap in accessory. So who can give me some advice or > guide, > > Thanks very much! > > There were huge improvements made in hammer in terms of mon efficiency in > these cases where it is under load. I recommend upgrading as that will > help. > > You can also mitigate the problem somewhat by adjusting the mon_lease and > associated settings up. Scale all of mon_lease, mon_lease_renew_interval, > mon_lease_ack_timeout, mon_accept_timeout by 2x or 3x. > > It also sounds like you may be using some older tunables/settings > for your pools or crush rules. Can you attach the output of 'ceph osd > dump' and 'ceph osd crush dump | tail -n 20' ? > > sage > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com