> On 02/07/2022 1:51 PM Maarten van Ingen <maarten.vanin...@surf.nl> wrote: > One more thing -- how many PGs do you have per OSD right now for the nvme and > hdd roots? > Can you share the output of `ceph osd df tree` ? > > >> This is only 1347 lines of text, you sure you want that :-) On a summary > >> for HDD we have between 7 and 55 PG's, OSD size range from 10 to 14TB. > >> NVMe is between 30 and 60, size all 1,4T; we run 4 OSD's per NVMe.
I see. pastebin can make this readable. Share privately if you prefer. > Generally, the autoscaler is trying to increase your pools so that there are > roughly 100 PGs per OSD. This number is a good rule of thumb to balance > memory usage of the OSD and balancing of the data. > However, if your cluster is already adequately balanced (with the upmap > balancer) then there might not be much use in splitting these pools further. > > >> We still have a few really old no Luminous clients and thus cannot use > >> upmap but the older one. Balancing has been done before by hand but it's > >> getting tedious at best and therefore we want (need) to use the > >> auto-balancer as well. Idea was to increase PG's first and auto balance > >> afterwards. The other balancers are not very good. Is it really not an option to upgrade those old clients so you can enable the upmap balancer? It should do a good job even before you split the pools. > That said -- some of your splits should go quite quickly, e.g. the nvme 256 > -> 2048 having only 4GB of data. > > >> That I know we already did a few splits from 128 to 256 and this was > >> really fast. But it's safe to incrase pg_num and pgp_num in one go for > >> these pools? I think the cli will only let you x2 or maybe x4 in one go. Set pg_num, then wait for them all to be created, then watch `ceph osd pool ls detail` -- it will show the pg_num, pgp_num, pg_num_target and pgp_num_target which together show the splitting progress. > Some more gentle advice, if you do decide to go ahead with this, would be to > take the autoscalers guidance and make the pg_num changes yourself. > (Splitting your pool having 1128T of data will take several weeks -- you > probably want to make the changes gradually, not all at once). > > >> That's the kind of advice we are looking for ;) would this mean going to > >> 8k first and then 16k or even intermediate steps? And pgp_num, how to > >> increase this? By gentle I mean a few hundred at a time, or even just 10 at first just to see how long the first few splitting PGs go. pgp_num should move automatically if you set pg_num. (See the _target vals I mentioned earlier). > (You can sense I'm hesitating to recommend you just blindly enable the > autoscaler now that you have so much data in the cluster -- I fear it might > be disruptive for several weeks at best, and at worst you may hit that pg log > OOM bug). > > >> But this bug would not hit with a manual increase? No, it could still hit. But we've split a huge pool from 4096 to 8192 sometime last year. It triggered a few bugs but no disasters. (It took a few weeks). -- dan _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io