Right, it would freeze the PGs in place at the time upmap-remapped is run. You need to keep running the upmap balancer afterwards to restore the optimized state.
I don't quite understand your question about a failed / replaced osd, but yes it is relevant here. Suppose you have osds 0, 1, 2, and 3 and the osd.1 fails: A hypothetical pg_upmap_items which mapped 0 to 1 *and* 2 to 3 would be removed when osd.1 is marked out. This would result in a PG being remapped and data moved from 3 to 2. [1] So if you run upmap-remapped just afterwards, it would create a new pg_upmap_items mapping 2 to 3 and making that PG active+clean again immediately. And then later when you recreate osd.1, crush would recalculate and then after some iterations of the upmap balancer the original pg_upmap_items would be created. -- Dan [1] this hints at an optimization for the "clean upmaps" functionalities in OSDMap.cc -- if an osd is marked out it could modify the relevant pg_upmap_items' accordingly, rather than remove them completely. On Sun, May 3, 2020 at 10:27 PM Anthony D'Atri <a...@dreamsnake.net> wrote: > > Do I misunderstand this script, or does it not _quite_ do what’s desired here? > > I fully get the scenario of applying a full-cluster map to allow incremental > topology changes. > > To be clear, if this is run to effectively freeze backfill during / following > a traumatic event, it will freeze that adapted state, not strictly return to > the pre-event state? And thus the pg-upmap balancer would still need to be > run to revert to the prior state? And this would also hold true for a > failed/replaced OSD? > > > > On May 1, 2020, at 7:37 AM, Dylan McCulloch <d...@unimelb.edu.au> wrote: > > > > Thanks Dan, that looks like a really neat method & script for a few > > use-cases. We've actually used several of the scripts in that repo over the > > years, so, many thanks for sharing. > > > > That method will definitely help in the scenario in which a set of > > unnecessary pg remaps have been triggered and can be caught early and > > reverted. I'm still a little concerned about the possibility of, for > > example, a brief network glitch occurring at night and then waking up to a > > full unbalanced cluster. Especially with NVMe clusters that can rapidly > > remap and rebalance (and for which we also have a greater impetus to > > squeeze out as much available capacity as possible with upmap due to cost > > per TB). It's just a risk I hadn't previously considered and was wondering > > if others have either run into it or felt any need to plan around it. > > > > Cheers, > > Dylan > > > > > >> From: Dan van der Ster <d...@vanderster.com> > >> Sent: Friday, 1 May 2020 5:53 PM > >> To: Dylan McCulloch <d...@unimelb.edu.au> > >> Cc: ceph-users <ceph-users@ceph.io> > >> > >> Subject: Re: [ceph-users] upmap balancer and consequences of osds briefly > >> marked out > >> > >> Hi, > >> > >> You're correct that all the relevant upmap entries are removed when an > >> OSD is marked out. > >> You can try to use this script which will recreate them and get the > >> cluster back to HEALTH_OK quickly: > >> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py > >> > >> Cheers, Dan > >> > >> > >> On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch <d...@unimelb.edu.au> wrote: > >>> > >>> Hi all, > >>> > >>> We're using upmap balancer which has made a huge improvement in evenly > >>> distributing data on our osds and has provided a substantial increase in > >>> usable capacity. > >>> > >>> Currently on ceph version: 12.2.13 luminous > >>> > >>> We ran into a firewall issue recently which led to a large number of osds > >>> being briefly marked 'down' & 'out'. The osds came back 'up' & 'in' after > >>> about 25 mins and the cluster was fine but had to perform a significant > >>> amount of backfilling/recovery despite > >> there being no end-user client I/O during that period. > >>> > >>> Presumably the large number of remapped pgs and backfills were due to > >>> pg_upmap_items being removed from the osdmap when osds were marked out > >>> and subsequently those pgs were redistributed using the default crush > >>> algorithm. > >>> As a result of the brief outage our cluster became significantly > >>> imbalanced again with several osds very close to full. > >>> Is there any reasonable mitigation for that scenario? > >>> > >>> The auto-balancer will not perform optimizations while there are degraded > >>> pgs, so it would only start reapplying pg upmap exceptions after initial > >>> recovery is complete (at which point capacity may be dangerously reduced). > >>> Similarly, as admins, we normally only apply changes when the cluster is > >>> in a healthy state, but if the same issue were to occur again would it be > >>> advisable to manually apply balancer plans while initial recovery is > >>> still taking place? > >>> > >>> I guess my concern from this experience is that making use of the > >>> capacity gained by using upmap balancer appears to carry some risk. i.e. > >>> it's possible for a brief outage to remove those space efficiencies > >>> relatively quickly and potentially result in full > >> osds/cluster before the automatic balancer is able to resume and > >> redistribute pgs using upmap. > >>> > >>> Curious whether others have any thoughts or experience regarding this. > >>> > >>> Cheers, > >>> Dylan > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io