Great summary David. Wouldn't this be worth a blog post?
On 17.05.2018 20:36, David Turner wrote: > By sticking with PG numbers as a base 2 number (1024, 16384, etc) all > of your PGs will be the same size and easier to balance and manage. > What happens when you have a non base 2 number is something like > this. Say you have 4 PGs that are all 2GB in size. If you increase > pg(p)_num to 6, then you will have 2 PGs that are 2GB and 4 PGs that > are 1GB as you've split 2 of the PGs into 4 to get to the 6 total. If > you increase the pg(p)_num to 8, then all 8 PGs will be 1GB. > Depending on how you manage your cluster, that doesn't really matter, > but for some methods of balancing your cluster, that will greatly > imbalance things. > > This would be a good time to go to a base 2 number. I think you're > thinking about Gluster where if you have 4 bricks and you want to > increase your capacity, going to anything other than a multiple of 4 > (8, 12, 16) kills performance (worse than increasing storage already > does) and takes longer as it has to weirdly divide the data instead of > splitting a single brick up to multiple bricks. > > As you increase your PGs, do this slowly and in a loop. I like to > increase my PGs by 256, wait for all PGs to create, activate, and > peer, rinse/repate until I get to my target. [1] This is an example > of a script that should accomplish this with no interference. Notice > the use of flags while increasing the PGs. It will make things take > much longer if you have an OSD OOM itself or die for any reason by > adding to the peering needing to happen. It will also be wasted IO to > start backfilling while you're still making changes; it's best to wait > until you finish increasing your PGs and everything peers before you > let data start moving. > > Another thing to keep in mind is how long your cluster will be moving > data around. Increasing your PG count on a pool full of data is one > of the most intensive operations you can tell a cluster to do. The > last time I had to do this, I increased pg(p)_num by 4k PGs from 16k > to 32k, let it backfill, rinse/repeat until the desired PG count was > achieved. For me, that 4k PGs would take 3-5 days depending on other > cluster load and how full the cluster was. If you do decide to > increase your PGs by 4k instead of the full increase, change the 16384 > to the number you decide to go to, backfill, continue. > > > [1] > # Make sure to set pool variable as well as the number ranges to the > appropriate values. > flags="nodown nobackfill norecover" > for flag in $flags; do > ceph osd set $flag > done > pool=rbd > echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs" > # The first number is your current PG count for the pool, the second > number is the target PG count, and the third number is how many to > increase it by each time through the loop. > for num in {7700..16384..256}; do > ceph osd pool set $pool pg_num $num > while sleep 10; do > ceph osd health | grep -q > 'peering\|stale\|activating\|creating\|inactive' || break > done > ceph osd pool set $pool pgp_num $num > while sleep 10; do > ceph osd health | grep -q > 'peering\|stale\|activating\|creating\|inactive' || break > done > done > for flag in $flags; do > ceph osd unset $flag > done > > On Thu, May 17, 2018 at 9:27 AM Kai Wagner <kwag...@suse.com > <mailto:kwag...@suse.com>> wrote: > > Hi Oliver, > > a good value is 100-150 PGs per OSD. So in your case between 20k > and 30k. > > You can increase your PGs, but keep in mind that this will keep the > cluster quite busy for some while. That said I would rather > increase in > smaller steps than in one large move. > > Kai > > > On 17.05.2018 01:29, Oliver Schulz wrote: > > Dear all, > > > > we have a Ceph cluster that has slowly evolved over several > > years and Ceph versions (started with 18 OSDs and 54 TB > > in 2013, now about 200 OSDs and 1.5 PB, still the same > > cluster, with data continuity). So there are some > > "early sins" in the cluster configuration, left over from > > the early days. > > > > One of these sins is the number of PGs in our CephFS "data" > > pool, which is 7200 and therefore not (as recommended) > > a power of two. Pretty much all of our data is in the > > "data" pool, the only other pools are "rbd" and "metadata", > > both contain little data (and they have way too many PGs > > already, another early sin). > > > > Is it possible - and safe - to change the number of "data" > > pool PGs from 7200 to 8192 or 16384? As we recently added > > more OSDs, I guess it would be time to increase the number > > of PGs anyhow. Or would we have to go to 14400 instead of > > 16384? > > > > > > Thanks for any advice, > > > > Oliver > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham > Norton, HRB 21284 (AG Nürnberg) > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com