[ceph-users] Re: Advice on enabling autoscaler

Dan van der Ster Mon, 07 Feb 2022 05:06:37 -0800

> On 02/07/2022 1:51 PM Maarten van Ingen <maarten.vanin...@surf.nl> wrote:
> One more thing -- how many PGs do you have per OSD right now for the nvme and 
> hdd roots?
> Can you share the output of `ceph osd df tree` ?
> 
> >> This is only 1347 lines of text, you sure you want that :-) On a summary 
> >> for HDD we have between 7 and 55 PG's, OSD size range from 10 to 14TB.
> >> NVMe is between 30 and 60, size all 1,4T; we run 4 OSD's per NVMe.


I see. pastebin can make this readable. Share privately if you prefer.

> Generally, the autoscaler is trying to increase your pools so that there are 
> roughly 100 PGs per OSD. This number is a good rule of thumb to balance 
> memory usage of the OSD and balancing of the data.
> However, if your cluster is already adequately balanced (with the upmap 
> balancer) then there might not be much use in splitting these pools further.
> 
> >> We still have a few really old no Luminous clients and thus cannot use 
> >> upmap but the older one. Balancing has been done before by hand but it's 
> >> getting tedious at best and therefore we want (need) to use the 
> >> auto-balancer as well. Idea was to increase PG's first and auto balance 
> >> afterwards.

The other balancers are not very good. Is it really not an option to upgrade 
those old clients so you can enable the upmap balancer? It should do a good job 
even before you split the pools.

> That said -- some of your splits should go quite quickly, e.g. the nvme 256 
> -> 2048 having only 4GB of data.
> 
> >> That I know we already did a few splits from 128 to 256 and this was 
> >> really fast. But it's safe to incrase pg_num and pgp_num in one go for 
> >> these pools?

I think the cli will only let you x2 or maybe x4 in one go.
Set pg_num, then wait for them all to be created, then watch `ceph osd pool ls 
detail` -- it will show the pg_num, pgp_num, pg_num_target and pgp_num_target 
which together show the splitting progress.

> Some more gentle advice, if you do decide to go ahead with this, would be to 
> take the autoscalers guidance and make the pg_num changes yourself. 
> (Splitting your pool having 1128T of data will take several weeks -- you 
> probably want to make the changes gradually, not all at once).
> 
> >> That's the kind of advice we are looking for ;) would this mean going to 
> >> 8k first and then 16k or even intermediate steps? And pgp_num, how to 
> >> increase this?

By gentle I mean a few hundred at a time, or even just 10 at first just to see 
how long the first few splitting PGs go.

pgp_num should move automatically if you set pg_num. (See the _target vals I 
mentioned earlier).

> (You can sense I'm hesitating to recommend you just blindly enable the 
> autoscaler now that you have so much data in the cluster -- I fear it might 
> be disruptive for several weeks at best, and at worst you may hit that pg log 
> OOM bug).
> 
> >> But this bug would not hit with a manual increase?

No, it could still hit.

But we've split a huge pool from 4096 to 8192 sometime last year. It triggered 
a few bugs but no disasters. (It took a few weeks).

-- dan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Advice on enabling autoscaler

Reply via email to