> Just to check, are you recommending that at some point each week all PGs are > clean *at the same time*, or that no PGs should be unclean for more than a > week?
The former I think, so that the cluster is converged, which in turn enables the mons to cull old maps and compact their DBs. > The latter absolutely makes sense, but the former can be quite hard to manage > sometimes this cluster, with about one drive failure a week we're somewhat at > the mercy of probability. We do always try and aim for 'clean-ish' every so > often though :) I think Dan’s suggestion of upmap-remapped is intended to address that, it lets one (temporarily) convince the cluster that there are no PGs / RADOS objects remapped / backfilling / recovering, so that the old maps can be released and everything compacted. > Also, just to double check my understanding here, the cluster needs to keep > hold of osdmaps going back to the point at which the currently unclean PGs > were last clean? At PG granularity, I think so. > So if a cluster has a bunch of backfill being queued continuously for a > month, but individual PGs get remapped and then backfilled quickly (e.g. > ~1day), the cluster only needs to hold onto maps for the day, rather than the > entire month period? Or am I missing something? I suspect that the actual experience may be different, that PG convergence isn’t strictly a FIFO. > The above is how I would imagine an even larger cluster would operate, with > the expectation is that there will always be at least one non-clean PG at any > time Dan made that important observation like ten years ago at an OpenStack Summit ;) > As long as PGs that are not clean will 'quickly' become clean, the range of > maps needing to be kept around will be fairly small and the cluster could > carry on in this state indefinitely. That’s the thing, I wouldn’t assume that the range is a small, sliding window. Successive changes to topology and state may kick a given PG’s convergence back to the end of the line so to speak. Speculation on my part. > Thanks for your various recommendations, there are definitely a few things we > don't do that we should (e.g. a balancer schedule). > > We don't make use upmap-remapped for normal operations currently, but I think > what you're proposing here makes a lot of sense, especially combined with a > balancer schedule. One of the issues I noted with this approach on this > cluster is the inevitability of degraded PGs due to an unrelated failed > drive/host stopping[1] the movement of data onto new disk/hosts/generations. > This causes us issues in planning big data moves, although is something we > could easily tweak. This is one of the nuances that drives me to counsel against HDDs — time to recover and thus increased risk of overlapping failures. R4 or EC with a relatively high value of m can guard against overlapping failures, but will themselves increase MTTR. > > Finally, thanks for the hint about how to identify how many maps are being > kept. Being able to track this is really handy, and takes a lot of the > guesswork out of understanding the need to take breaks in cluster operations. > I think we also need to pay more attention to 'unclean durations' of > individual PGs, which is something we can do. Dan is a treasure. > > Cheers, > Tom > > [1] > https://github.com/ceph/ceph/blob/main/src/pybind/mgr/balancer/module.py#L1040 > ________________________________________ > From: Dan van der Ster <dan.vanders...@clyso.com> > Sent: Tuesday, January 7, 2025 21:15 > To: Byrne, Thomas (STFC,RAL,SC) <tom.by...@stfc.ac.uk> > Cc: ceph-users@ceph.io <ceph-users@ceph.io> > Subject: Re: [ceph-users] Slow initial boot of OSDs in large cluster with > unclean state > > Hi Tom, > > On Tue, Jan 7, 2025 at 10:15 AM Thomas Byrne - STFC UKRI > <tom.by...@stfc.ac.uk> wrote: >> I realise the obvious answer here is don't leave big cluster in an unclean >> state for this long. Currently we've got PGs that have been remapped for 5 >> days, which matches the 30,000 OSDMap epoch range perfectly. This is >> something we're always looking at from a procedure point of view e.g. >> keeping max_backfills as high as possible by default, ensuring balancer >> max_misplaced is appropriate, re-evaluating disk and node addition/removal >> processes. But the reality on this cluster is that sometimes these 'logjams' >> happen, and it would be good to understand if we can improve the OSD >> addition experience so we can continue to be flexible with our operation >> scheduling. > > I find it's always best to aim to have all PGs clean at least once a > week -- that way the osdmaps can be trimmed at least weekly, > preventing all sorts of nastiness, one of which you mentioned here. > > Here's my recommended mgr balancer tuning: > > # Balance PGs Sunday to Friday, letting the backfilling finish on > Saturdays. (adjust the exact days if needed -- the goal here is that > at some point in the week, there needs to be 0 misplaced and 0 > degraded objects.) > ceph config set mgr mgr/balancer/begin_weekday 0 > ceph config set mgr mgr/balancer/end_weekday 5 > > # [Alternatively] Balance PGs during working hours, letting the > backfilling finish over night: > ceph config set mgr mgr/balancer/begin_time 0830 > ceph config set mgr mgr/balancer/end_time 1800 > > # Decrease the max misplaced from the default 5% to 0.5%, to minimize > the impact of backfilling and ensure the tail of backfilling PGs can > finish over the weekend or over night -- increase this percentage if > your cluster can tolerate it. (IMHO 5% is way too many misplaced > objects on large clusters, but this is very use-case-specific). > ceph config set mgr target_max_misplaced_ratio 0.005 > > # Configure the balancer to aim for +/- 1 PG per pool per OSD -- this > is the best uniformity we can hope for with the mgr balancer > ceph config set mgr mgr/balancer/upmap_max_deviation 1 > > Then whenever you add/remove hardware, here's my recommended procedure: > > 1. Set some flags to prevent data from moving immediately when we add new > OSDs: > ceph osd set norebalance > ceph balancer off > > 2. Add the new OSDs. (Or start draining -- but note that if you are > draining OSDs, set the crush weights to 0.1 instead of 0.0 -- upmap > magic tools don't work with OSDs having crush weight = 0). > > 3. Run ./upmap-remapped.py [1] until the number of misplaced objects > is as close as possible to zero. > > 4. Then unset the flags so data starts rebalancing again. I.e. the mgr > balancer will move data in a controlled manner to those new empty > OSDs: > > ceph osd unset norebalance > ceph balancer on > > I have a couple talks about this for more on this topic: > - https://www.youtube.com/watch?v=6PQYHlerJ8k > - https://www.youtube.com/watch?v=A4xG975UWts > > We also have a plan to get this logic directly into ceph: > https://tracker.ceph.com/issues/67418 > > As to what you can do right now -- it's actually a great time to test > out the above approach. Here's exactly what I'd do: > > 1. Stop those new OSDs (the ones that are not "in" yet) -- no point > having them pull in 30000 osdmaps. Nothing should be degraded at this > point -- if so, you either stopped too many OSDs, or there was some > OSD flap that you need to recover from. > > 2. Since you have several remapped PGs right now, that's a perfect > time to use upmap-remapped.py [1] -- it'll make the remapped PGs clean > again. So try running it: > > ceph balancer off # disabled the mgr balancer, otherwise it would > "undo" what we do next > ./upmap-remapped.py # this just outputs commands directly to stdout. > ./upmap-remapped.py | sh -x # this will run those commands. > ./upmap-remapped.py | sh -x # run it again -- normally we need to > just run it twice to get to a minimal number of misplaced PGs. > > 3. When you run it, you should see the % misplaced objects decreasing. > Ideally it will go to 0, meaning all PGs are active+clean. At that > point the OSDmaps should trim. > > 4. Confirm that osdmaps have trimmed by looking at the `ceph report`: > > ceph report | jq '(.osdmap_last_committed - .osdmap_first_committed)' > > ^^ the number above should be less than 750. If not -- then the > osdmaps are not trimmed, and you need to investigate further. > > 5. Now start those new OSDs, they should pull in the ~750 osdmaps > quickly, and then do the upmap-remapped procedure after configuring > the balancer as I described. > > Hope this all helps, Happy New Year Tom. > > Cheers, Dan > > [1] > https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py > > -- > Dan van der Ster > CTO @ CLYSO > Try our Ceph Analyzer -- https://analyzer.clyso.com/ > https://clyso.com | dan.vanders...@clyso.com > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io