Great summary David. Wouldn't this be worth a blog post?

On 17.05.2018 20:36, David Turner wrote:
> By sticking with PG numbers as a base 2 number (1024, 16384, etc) all
> of your PGs will be the same size and easier to balance and manage. 
> What happens when you have a non base 2 number is something like
> this.  Say you have 4 PGs that are all 2GB in size.  If you increase
> pg(p)_num to 6, then you will have 2 PGs that are 2GB and 4 PGs that
> are 1GB as you've split 2 of the PGs into 4 to get to the 6 total.  If
> you increase the pg(p)_num to 8, then all 8 PGs will be 1GB. 
> Depending on how you manage your cluster, that doesn't really matter,
> but for some methods of balancing your cluster, that will greatly
> imbalance things.
>
> This would be a good time to go to a base 2 number.  I think you're
> thinking about Gluster where if you have 4 bricks and you want to
> increase your capacity, going to anything other than a multiple of 4
> (8, 12, 16) kills performance (worse than increasing storage already
> does) and takes longer as it has to weirdly divide the data instead of
> splitting a single brick up to multiple bricks.
>
> As you increase your PGs, do this slowly and in a loop.  I like to
> increase my PGs by 256, wait for all PGs to create, activate, and
> peer, rinse/repate until I get to my target.  [1] This is an example
> of a script that should accomplish this with no interference.  Notice
> the use of flags while increasing the PGs.  It will make things take
> much longer if you have an OSD OOM itself or die for any reason by
> adding to the peering needing to happen.  It will also be wasted IO to
> start backfilling while you're still making changes; it's best to wait
> until you finish increasing your PGs and everything peers before you
> let data start moving.
>
> Another thing to keep in mind is how long your cluster will be moving
> data around.  Increasing your PG count on a pool full of data is one
> of the most intensive operations you can tell a cluster to do.  The
> last time I had to do this, I increased pg(p)_num by 4k PGs from 16k
> to 32k, let it backfill, rinse/repeat until the desired PG count was
> achieved.  For me, that 4k PGs would take 3-5 days depending on other
> cluster load and how full the cluster was.  If you do decide to
> increase your PGs by 4k instead of the full increase, change the 16384
> to the number you decide to go to, backfill, continue. 
>
>
> [1]
> # Make sure to set pool variable as well as the number ranges to the
> appropriate values.
> flags="nodown nobackfill norecover"
> for flag in $flags; do
>   ceph osd set $flag
> done
> pool=rbd
> echo "$pool currently has $(ceph osd pool get $pool pg_num) PGs"
> # The first number is your current PG count for the pool, the second
> number is the target PG count, and the third number is how many to
> increase it by each time through the loop.
> for num in {7700..16384..256}; do
>   ceph osd pool set $pool pg_num $num
>   while sleep 10; do
>     ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
>   done
>   ceph osd pool set $pool pgp_num $num
>   while sleep 10; do
>     ceph osd health | grep -q
> 'peering\|stale\|activating\|creating\|inactive' || break
>   done
> done
> for flag in $flags; do
>   ceph osd unset $flag
> done
>
> On Thu, May 17, 2018 at 9:27 AM Kai Wagner <kwag...@suse.com
> <mailto:kwag...@suse.com>> wrote:
>
>     Hi Oliver,
>
>     a good value is 100-150 PGs per OSD. So in your case between 20k
>     and 30k.
>
>     You can increase your PGs, but keep in mind that this will keep the
>     cluster quite busy for some while. That said I would rather
>     increase in
>     smaller steps than in one large move.
>
>     Kai
>
>
>     On 17.05.2018 01:29, Oliver Schulz wrote:
>     > Dear all,
>     >
>     > we have a Ceph cluster that has slowly evolved over several
>     > years and Ceph versions (started with 18 OSDs and 54 TB
>     > in 2013, now about 200 OSDs and 1.5 PB, still the same
>     > cluster, with data continuity). So there are some
>     > "early sins" in the cluster configuration, left over from
>     > the early days.
>     >
>     > One of these sins is the number of PGs in our CephFS "data"
>     > pool, which is 7200 and therefore not (as recommended)
>     > a power of two. Pretty much all of our data is in the
>     > "data" pool, the only other pools are "rbd" and "metadata",
>     > both contain little data (and they have way too many PGs
>     > already, another early sin).
>     >
>     > Is it possible - and safe - to change the number of "data"
>     > pool PGs from 7200 to 8192 or 16384? As we recently added
>     > more OSDs, I guess it would be time to increase the number
>     > of PGs anyhow. Or would we have to go to 14400 instead of
>     > 16384?
>     >
>     >
>     > Thanks for any advice,
>     >
>     > Oliver
>     > _______________________________________________
>     > ceph-users mailing list
>     > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>
>     -- 
>     SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham
>     Norton, HRB 21284 (AG Nürnberg)
>
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to