[ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Stillwell, Bryan Fri, 17 Jan 2025 12:47:52 -0800

The latest version (since September) switched to using the python rados 
bindings which not only fixes this problem, but also makes it much faster.  It 
also has a fix I made that orders the upmaps so that data is moved off of OSDs 
before trying to move data on to them.  This helps a lot on clusters with EC 
pools.

Bryan

From: Alexander Patrakov <patra...@gmail.com>
Date: Friday, January 17, 2025 at 09:53
To: Anthony D'Atri <anthony.da...@gmail.com>
Cc: Kasper Rasmussen <kasper_steenga...@hotmail.com>, ceph-users@ceph.io 
<ceph-users@ceph.io>
Subject: [ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of 
data - advice/experience
!-------------------------------------------------------------------|
  This Message Is From an Untrusted Sender
  You have not previously corresponded with this sender.
|-------------------------------------------------------------------!

Hello Kasper,

Please be aware that the current "upmap-remapped" script is flaky. It
might just refuse to work, with this message:

Error loading remapped pgs

This has been traced to the fact that "ceph pg ls remapped -f json"
sets its stderr to non-blocking mode, and that is the same file
descriptor to which jq (which follows in the pipeline) writes. Thus,
jq can get -EAGAIN and terminate prematurely.

The problem is tracked as 
https://urldefense.com/v3/__https://tracker.ceph.com/issues/67505__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h3jtdOyn$<https://urldefense.com/v3/__https:/tracker.ceph.com/issues/67505__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h3jtdOyn$>

Retrying the script might help.

What's worse is that the whole reason for adding jq to the
upmap-remapped script is another Ceph bug: it sometimes outputs
invalid JSON (containing a literal inf or nan instead of a number),
and this became much more common with Reef, as new fields were added
that are commonly equal to inf or nan. This is tracked as
https://urldefense.com/v3/__https://tracker.ceph.com/issues/66215__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h5M5tXer$<https://urldefense.com/v3/__https:/tracker.ceph.com/issues/66215__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h5M5tXer$>
  and has a fix merged in a
not-yet-released version.

Maybe you should look into alternative tools, like
https://urldefense.com/v3/__https://github.com/digitalocean/pgremapper__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h8BgK2LL$<https://urldefense.com/v3/__https:/github.com/digitalocean/pgremapper__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h8BgK2LL$>

On Fri, Jan 17, 2025 at 11:43 PM Anthony D'Atri <anthony.da...@gmail.com> wrote:
>
>
>
> > On Jan 17, 2025, at 6:02 AM, Kasper Rasmussen 
> > <kasper_steenga...@hotmail.com> wrote:
> >
> > However I'm concerned with the amount of data that needs to be rebalanced, 
> > since the cluster holds multiple PB, and I'm looking for review of/input 
> > for my plan, as well as words of advice/experience from someone who has 
> > been in a similar situation.
>
> Yep, that’s why you want to use upmap-remapped.  Otherwise the thundering 
> herd of data shuffling will DoS your client traffic, esp. since you’re using 
> spinners.  Count on pretty much all data moving in the process, and the 
> convergence taking …. maybe a week?
>
> > On Pacific: Data is marked as "degraded", and not misplaced as expected. I 
> > also see above 2000% degraded data (but that might be another issue)
> >
> > On Quincy: Data is marked as misplaced - which seems correct.
>
>
> I’m not specifically familiar with such a change, but that could be mainly 
> cosmetic, a function of how the percentage is calculated for objects / PGs 
> that are multiply remapped.
>
> In the depths of time I had clusters that would sometimes show a negative 
> number of RADOS objects to recover, it would bounce above and below zero a 
> few times as it converged to 0.
>
>
> > Instead balancing has been done by a cron job executing - ceph osd 
> > reweight-by-utilization 112 0.05 30
>
> I used a similar strategy with older releases.  Note that this will 
> complicate your transition, as those relative weights are a function of the 
> CRUSH topology, so when the topology changes, likely some reweighted OSDs 
> will get much less than their fair share, and some will get much more.  How 
> full is your cluster (ceph df)?  It might not be a bad idea to incrementally 
> revert those all to 1.00000 if you have the capacity, and disable the cron 
> job.
> You’ll also likely want to switch to the balancer module for the 
> upmap-remapped strategy to incrementally move your data around.  Did you have 
> it disabled for a specific reason?
>
> Updating to Reef before migrating might be to your advantage so that you can 
> benefit from performance and efficiency improvements since Pacific.
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Reply via email to