The latest version (since September) switched to using the python rados bindings which not only fixes this problem, but also makes it much faster. It also has a fix I made that orders the upmaps so that data is moved off of OSDs before trying to move data on to them. This helps a lot on clusters with EC pools.
Bryan From: Alexander Patrakov <patra...@gmail.com> Date: Friday, January 17, 2025 at 09:53 To: Anthony D'Atri <anthony.da...@gmail.com> Cc: Kasper Rasmussen <kasper_steenga...@hotmail.com>, ceph-users@ceph.io <ceph-users@ceph.io> Subject: [ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience !-------------------------------------------------------------------| This Message Is From an Untrusted Sender You have not previously corresponded with this sender. |-------------------------------------------------------------------! Hello Kasper, Please be aware that the current "upmap-remapped" script is flaky. It might just refuse to work, with this message: Error loading remapped pgs This has been traced to the fact that "ceph pg ls remapped -f json" sets its stderr to non-blocking mode, and that is the same file descriptor to which jq (which follows in the pipeline) writes. Thus, jq can get -EAGAIN and terminate prematurely. The problem is tracked as https://urldefense.com/v3/__https://tracker.ceph.com/issues/67505__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h3jtdOyn$<https://urldefense.com/v3/__https:/tracker.ceph.com/issues/67505__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h3jtdOyn$> Retrying the script might help. What's worse is that the whole reason for adding jq to the upmap-remapped script is another Ceph bug: it sometimes outputs invalid JSON (containing a literal inf or nan instead of a number), and this became much more common with Reef, as new fields were added that are commonly equal to inf or nan. This is tracked as https://urldefense.com/v3/__https://tracker.ceph.com/issues/66215__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h5M5tXer$<https://urldefense.com/v3/__https:/tracker.ceph.com/issues/66215__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h5M5tXer$> and has a fix merged in a not-yet-released version. Maybe you should look into alternative tools, like https://urldefense.com/v3/__https://github.com/digitalocean/pgremapper__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h8BgK2LL$<https://urldefense.com/v3/__https:/github.com/digitalocean/pgremapper__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h8BgK2LL$> On Fri, Jan 17, 2025 at 11:43 PM Anthony D'Atri <anthony.da...@gmail.com> wrote: > > > > > On Jan 17, 2025, at 6:02 AM, Kasper Rasmussen > > <kasper_steenga...@hotmail.com> wrote: > > > > However I'm concerned with the amount of data that needs to be rebalanced, > > since the cluster holds multiple PB, and I'm looking for review of/input > > for my plan, as well as words of advice/experience from someone who has > > been in a similar situation. > > Yep, that’s why you want to use upmap-remapped. Otherwise the thundering > herd of data shuffling will DoS your client traffic, esp. since you’re using > spinners. Count on pretty much all data moving in the process, and the > convergence taking …. maybe a week? > > > On Pacific: Data is marked as "degraded", and not misplaced as expected. I > > also see above 2000% degraded data (but that might be another issue) > > > > On Quincy: Data is marked as misplaced - which seems correct. > > > I’m not specifically familiar with such a change, but that could be mainly > cosmetic, a function of how the percentage is calculated for objects / PGs > that are multiply remapped. > > In the depths of time I had clusters that would sometimes show a negative > number of RADOS objects to recover, it would bounce above and below zero a > few times as it converged to 0. > > > > Instead balancing has been done by a cron job executing - ceph osd > > reweight-by-utilization 112 0.05 30 > > I used a similar strategy with older releases. Note that this will > complicate your transition, as those relative weights are a function of the > CRUSH topology, so when the topology changes, likely some reweighted OSDs > will get much less than their fair share, and some will get much more. How > full is your cluster (ceph df)? It might not be a bad idea to incrementally > revert those all to 1.00000 if you have the capacity, and disable the cron > job. > You’ll also likely want to switch to the balancer module for the > upmap-remapped strategy to incrementally move your data around. Did you have > it disabled for a specific reason? > > Updating to Reef before migrating might be to your advantage so that you can > benefit from performance and efficiency improvements since Pacific. > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io