[ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Anthony D'Atri Fri, 17 Jan 2025 14:47:30 -0800

That’s great to know, Bryan.  I’ve seen multiple locations for the code out 
there, which one is canonical? (Lowercase c)


> On Jan 17, 2025, at 3:46 PM, Stillwell, Bryan <bstil...@akamai.com> wrote:
> 
> The latest version (since September) switched to using the python rados 
> bindings which not only fixes this problem, but also makes it much faster.  
> It also has a fix I made that orders the upmaps so that data is moved off of 
> OSDs before trying to move data on to them.  This helps a lot on clusters 
> with EC pools.
>  
> Bryan
>  
> From: Alexander Patrakov <patra...@gmail.com <mailto:patra...@gmail.com>>
> Date: Friday, January 17, 2025 at 09:53
> To: Anthony D'Atri <anthony.da...@gmail.com <mailto:anthony.da...@gmail.com>>
> Cc: Kasper Rasmussen <kasper_steenga...@hotmail.com 
> <mailto:kasper_steenga...@hotmail.com>>, ceph-users@ceph.io 
> <mailto:ceph-users@ceph.io> <ceph-users@ceph.io <mailto:ceph-users@ceph.io>>
> Subject: [ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB 
> of data - advice/experience
> 
> !-------------------------------------------------------------------|
>   This Message Is From an Untrusted Sender
>   You have not previously corresponded with this sender.
> |-------------------------------------------------------------------!
> 
> Hello Kasper,
> 
> Please be aware that the current "upmap-remapped" script is flaky. It
> might just refuse to work, with this message:
> 
> Error loading remapped pgs
> 
> This has been traced to the fact that "ceph pg ls remapped -f json"
> sets its stderr to non-blocking mode, and that is the same file
> descriptor to which jq (which follows in the pipeline) writes. Thus,
> jq can get -EAGAIN and terminate prematurely.
> 
> The problem is tracked as 
> https://urldefense.com/v3/__https://tracker.ceph.com/issues/67505__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h3jtdOyn$
>  
> <https://urldefense.com/v3/__https:/tracker.ceph.com/issues/67505__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h3jtdOyn$>
> 
> Retrying the script might help.
> 
> What's worse is that the whole reason for adding jq to the
> upmap-remapped script is another Ceph bug: it sometimes outputs
> invalid JSON (containing a literal inf or nan instead of a number),
> and this became much more common with Reef, as new fields were added
> that are commonly equal to inf or nan. This is tracked as
> https://urldefense.com/v3/__https://tracker.ceph.com/issues/66215__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h5M5tXer$
>  
> <https://urldefense.com/v3/__https:/tracker.ceph.com/issues/66215__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h5M5tXer$>
>   and has a fix merged in a
> not-yet-released version.
> 
> Maybe you should look into alternative tools, like
> https://urldefense.com/v3/__https://github.com/digitalocean/pgremapper__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h8BgK2LL$
>  
> <https://urldefense.com/v3/__https:/github.com/digitalocean/pgremapper__;!!GjvTz_vk!UldZKAbJ2Z9kMh9IMdHxZdGbAmWC6sE3ekqhHQMHb-HchhMen_khX4bU3IQcH2foYQtx9R_4h8BgK2LL$>
> 
> 
> On Fri, Jan 17, 2025 at 11:43 PM Anthony D'Atri <anthony.da...@gmail.com 
> <mailto:anthony.da...@gmail.com>> wrote:
> >
> >
> >
> > > On Jan 17, 2025, at 6:02 AM, Kasper Rasmussen 
> > > <kasper_steenga...@hotmail.com <mailto:kasper_steenga...@hotmail.com>> 
> > > wrote:
> > >
> > > However I'm concerned with the amount of data that needs to be 
> > > rebalanced, since the cluster holds multiple PB, and I'm looking for 
> > > review of/input for my plan, as well as words of advice/experience from 
> > > someone who has been in a similar situation.
> >
> > Yep, that’s why you want to use upmap-remapped.  Otherwise the thundering 
> > herd of data shuffling will DoS your client traffic, esp. since you’re 
> > using spinners.  Count on pretty much all data moving in the process, and 
> > the convergence taking …. maybe a week?
> >
> > > On Pacific: Data is marked as "degraded", and not misplaced as expected. 
> > > I also see above 2000% degraded data (but that might be another issue)
> > >
> > > On Quincy: Data is marked as misplaced - which seems correct.
> >
> >
> > I’m not specifically familiar with such a change, but that could be mainly 
> > cosmetic, a function of how the percentage is calculated for objects / PGs 
> > that are multiply remapped.
> >
> > In the depths of time I had clusters that would sometimes show a negative 
> > number of RADOS objects to recover, it would bounce above and below zero a 
> > few times as it converged to 0.
> >
> >
> > > Instead balancing has been done by a cron job executing - ceph osd 
> > > reweight-by-utilization 112 0.05 30
> >
> > I used a similar strategy with older releases.  Note that this will 
> > complicate your transition, as those relative weights are a function of the 
> > CRUSH topology, so when the topology changes, likely some reweighted OSDs 
> > will get much less than their fair share, and some will get much more.  How 
> > full is your cluster (ceph df)?  It might not be a bad idea to 
> > incrementally revert those all to 1.00000 if you have the capacity, and 
> > disable the cron job.
> > You’ll also likely want to switch to the balancer module for the 
> > upmap-remapped strategy to incrementally move your data around.  Did you 
> > have it disabled for a specific reason?
> >
> > Updating to Reef before migrating might be to your advantage so that you 
> > can benefit from performance and efficiency improvements since Pacific.
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > <mailto:ceph-users-le...@ceph.io>
> 
> 
> 
> --
> Alexander Patrakov
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Reply via email to