[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

Anthony D'Atri Wed, 08 Jan 2025 06:56:39 -0800


> Just to check, are you recommending that at some point each week all PGs are 
> clean *at the same time*, or that no PGs should be unclean for more than a 
> week?


The former I think, so that the cluster is converged, which in turn enables the 
mons to cull old maps and compact their DBs.

> The latter absolutely makes sense, but the former can be quite hard to manage 
> sometimes this cluster, with about one drive failure a week we're somewhat at 
> the mercy of probability. We do always try and aim for 'clean-ish' every so 
> often though :)

I think Dan’s suggestion of upmap-remapped is intended to address that, it lets 
one (temporarily) convince the cluster that there are no PGs / RADOS objects 
remapped / backfilling / recovering, so that the old maps can be released and 
everything compacted.


> Also, just to double check my understanding here, the cluster needs to keep 
> hold of osdmaps going back to the point at which the currently unclean PGs 
> were last clean?

At PG granularity, I think so.

> So if a cluster has a bunch of backfill being queued continuously for a 
> month, but individual PGs get remapped and then backfilled quickly (e.g. 
> ~1day), the cluster only needs to hold onto maps for the day, rather than the 
> entire month period? Or am I missing something?

I suspect that the actual experience may be different, that PG convergence 
isn’t strictly a FIFO.  

> The above is how I would imagine an even larger cluster would operate, with 
> the expectation is that there will always be at least one non-clean PG at any 
> time

Dan made that important observation like ten years ago at an OpenStack Summit ;)

> As long as PGs that are not clean will 'quickly' become clean, the range of 
> maps needing to be kept around will be fairly small and the cluster could 
> carry on in this state indefinitely.

That’s the thing, I wouldn’t assume that the range is a small, sliding window.  
Successive changes to topology and state may kick a given PG’s convergence back 
to the end of the line so to speak.  Speculation on my part.

> Thanks for your various recommendations, there are definitely a few things we 
> don't do that we should (e.g. a balancer schedule).
> 
> We don't make use upmap-remapped for normal operations currently, but I think 
> what you're proposing here makes a lot of sense, especially combined with a 
> balancer schedule. One of the issues I noted with this approach on this 
> cluster is the inevitability of degraded PGs due to an unrelated failed 
> drive/host stopping[1] the movement of data onto new disk/hosts/generations. 
> This causes us issues in planning big data moves, although is something we 
> could easily tweak.

This is one of the nuances that drives me to counsel against HDDs — time to 
recover and thus increased risk of overlapping failures.  R4 or EC with a 
relatively high value of m can guard against overlapping failures, but will 
themselves increase MTTR.

> 
> Finally, thanks for the hint about how to identify how many maps are being 
> kept. Being able to track this is really handy, and takes a lot of the 
> guesswork out of understanding the need to take breaks in cluster operations. 
> I think we also need to pay more attention to 'unclean durations' of 
> individual PGs, which is something we can do.

Dan is a treasure.

> 
> Cheers,
> Tom
> 
> [1] 
> https://github.com/ceph/ceph/blob/main/src/pybind/mgr/balancer/module.py#L1040
> ________________________________________
> From: Dan van der Ster <dan.vanders...@clyso.com>
> Sent: Tuesday, January 7, 2025 21:15
> To: Byrne, Thomas (STFC,RAL,SC) <tom.by...@stfc.ac.uk>
> Cc: ceph-users@ceph.io <ceph-users@ceph.io>
> Subject: Re: [ceph-users] Slow initial boot of OSDs in large cluster with 
> unclean state
>  
> Hi Tom,
> 
> On Tue, Jan 7, 2025 at 10:15 AM Thomas Byrne - STFC UKRI
> <tom.by...@stfc.ac.uk> wrote:
>> I realise the obvious answer here is don't leave big cluster in an unclean 
>> state for this long. Currently we've got PGs that have been remapped for 5 
>> days, which matches the 30,000 OSDMap epoch range perfectly. This is 
>> something we're always looking at from a procedure point of view e.g. 
>> keeping max_backfills as high as possible by default, ensuring balancer 
>> max_misplaced is appropriate, re-evaluating disk and node addition/removal 
>> processes. But the reality on this cluster is that sometimes these 'logjams' 
>> happen, and it would be good to understand if we can improve the OSD 
>> addition experience so we can continue to be flexible with our operation 
>> scheduling.
> 
> I find it's always best to aim to have all PGs clean at least once a
> week -- that way the osdmaps can be trimmed at least weekly,
> preventing all sorts of nastiness, one of which you mentioned here.
> 
> Here's my recommended mgr balancer tuning:
> 
> # Balance PGs Sunday to Friday, letting the backfilling finish on
> Saturdays. (adjust the exact days if needed -- the goal here is that
> at some point in the week, there needs to be 0 misplaced and 0
> degraded objects.)
> ceph config set mgr mgr/balancer/begin_weekday 0
> ceph config set mgr mgr/balancer/end_weekday 5
> 
> # [Alternatively] Balance PGs during working hours, letting the
> backfilling finish over night:
> ceph config set mgr mgr/balancer/begin_time 0830
> ceph config set mgr mgr/balancer/end_time 1800
> 
> # Decrease the max misplaced from the default 5% to 0.5%, to minimize
> the impact of backfilling and ensure the tail of backfilling PGs can
> finish over the weekend or over night -- increase this percentage if
> your cluster can tolerate it. (IMHO 5% is way too many misplaced
> objects on large clusters, but this is very use-case-specific).
> ceph config set mgr target_max_misplaced_ratio 0.005
> 
> # Configure the balancer to aim for +/- 1 PG per pool per OSD -- this
> is the best uniformity we can hope for with the mgr balancer
> ceph config set mgr mgr/balancer/upmap_max_deviation 1
> 
> Then whenever you add/remove hardware, here's my recommended procedure:
> 
> 1. Set some flags to prevent data from moving immediately when we add new 
> OSDs:
>    ceph osd set norebalance
>    ceph balancer off
> 
> 2. Add the new OSDs. (Or start draining -- but note that if you are
> draining OSDs, set the crush weights to 0.1 instead of 0.0 -- upmap
> magic tools don't work with OSDs having crush weight = 0).
> 
> 3. Run ./upmap-remapped.py [1] until the number of misplaced objects
> is as close as possible to zero.
> 
> 4. Then unset the flags so data starts rebalancing again. I.e. the mgr
> balancer will move data in a controlled manner to those new empty
> OSDs:
> 
>   ceph osd unset norebalance
>   ceph balancer on
> 
> I have a couple talks about this for more on this topic:
>  - https://www.youtube.com/watch?v=6PQYHlerJ8k
>  - https://www.youtube.com/watch?v=A4xG975UWts
> 
> We also have a plan to get this logic directly into ceph:
> https://tracker.ceph.com/issues/67418
> 
> As to what you can do right now -- it's actually a great time to test
> out the above approach. Here's exactly what I'd do:
> 
> 1. Stop those new OSDs (the ones that are not "in" yet) -- no point
> having them pull in 30000 osdmaps. Nothing should be degraded at this
> point -- if so, you either stopped too many OSDs, or there was some
> OSD flap that you need to recover from.
> 
> 2. Since you have several remapped PGs right now, that's a perfect
> time to use upmap-remapped.py [1] -- it'll make the remapped PGs clean
> again.  So try running it:
> 
>   ceph balancer off # disabled the mgr balancer, otherwise it would
> "undo" what we do next
>   ./upmap-remapped.py # this just outputs commands directly to stdout.
>   ./upmap-remapped.py | sh -x # this will run those commands.
>   ./upmap-remapped.py | sh -x # run it again -- normally we need to
> just run it twice to get to a minimal number of misplaced PGs.
> 
> 3. When you run it, you should see the % misplaced objects decreasing.
> Ideally it will go to 0, meaning all PGs are active+clean. At that
> point the OSDmaps should trim.
> 
> 4. Confirm that osdmaps have trimmed by looking at the `ceph report`:
> 
>   ceph report | jq '(.osdmap_last_committed - .osdmap_first_committed)'
> 
> ^^ the number above should be less than 750. If not -- then the
> osdmaps are not trimmed, and you need to investigate further.
> 
> 5. Now start those new OSDs, they should pull in the ~750 osdmaps
> quickly, and then do the upmap-remapped procedure after configuring
> the balancer as I described.
> 
> Hope this all helps, Happy New Year Tom.
> 
> Cheers, Dan
> 
> [1] 
> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
> 
> --
> Dan van der Ster
> CTO @ CLYSO
> Try our Ceph Analyzer -- https://analyzer.clyso.com/
> https://clyso.com | dan.vanders...@clyso.com
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow initial boot of OSDs in large cluster with unclean state

Reply via email to