[ceph-users] Re: Newly added node osds getting full before rebalance finished

Szabo, Istvan (Agoda) Tue, 27 May 2025 02:13:59 -0700

Some status update, it's finished with 3x times stop and start the rebalance.
Would be interesting to know what is the extra data generated on the new osds 
during remapped pg allocation at rebalance. I stopped when the osd reaches 89% 
full, I let it rest for 2 days, it went down to 64%. After restart filled up 
again to 89%, I stopped for a day and it went down to 71%, then final restart 
finished.


I have some concerns regarding the PG allocator algorithm in 
Octopus—specifically, there may be issues either within the algorithm itself or 
potentially due to anomalous data generated during rebalancing. I'm not certain 
if this has been addressed or optimized in more recent releases.
Currently, our maximum PG size is 60GB. However, during rebalancing, it appears 
that PGs are being allocated without a thorough check on their sizes. Upon 
reviewing the number of PGs on our OSDs, I’m seeing values between 60 and 70. 
This implies that if 60–70 PGs, each sized at 60GB, are assigned to a single 
OSD, the OSD will inevitably reach full capacity. A more balanced approach 
would be to distribute smaller PGs as well, not just the largest ones. It is 
possible that the balancer intervenes at a later stage but not when has 
remapped pgs, this is not entirely clear from the current behavior.


________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: Sunday, May 25, 2025 10:39 AM
To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
Cc: Anthony D'Atri <a...@dreamsnake.net>; Ceph Users <ceph-users@ceph.io>
Subject: Re: [ceph-users] Newly added node osds getting full before rebalance 
finished

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !
________________________________
I might suggest after ensuring what I write before :

Upmap-remapped 2-3 times until the misplaced objects / PGs mostly evaporate.  
Set the max deviation and misplaced ratio to 1, turn on the balancer.

You might need to raise the backfillfull and full ratios a couple percent 
temporarily until the logjam subsides.

On May 24, 2025, at 10:40 PM, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> 
wrote:


Stopped for 1.5 days already, it went down the usage from 88% to 62 or less on 
that node osds, however the missplaced objects went up from 5% to almost 7% 
most probably due to newly created stuff.

Here is some details, the newly added node is the serverosd-2015:
https://gist.githubusercontent.com/Badb0yBadb0y/8addf9f3df0406abc3397dd9e7f1aeca/raw/8dc686576f9dcea12542996fd6bacbac1050a89c/gistfile1.txt

________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: Saturday, May 24, 2025 5:44 AM
To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
Cc: Anthony D'Atri <a...@dreamsnake.net>; Ceph Users <ceph-users@ceph.io>
Subject: Re: [ceph-users] Newly added node osds getting full before rebalance 
finished

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !
________________________________

Please do send links to crush dump, Ceph OSD tree, Ceph OSD df

> On May 23, 2025, at 10:53 AM, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> 
> wrote:
>
> Octopus yeah, seems like want to push to the new node bigger pgs or more 
> data, I wonder if I add more disks to this node the weight (not crush weight) 
> of the node would be "heavier" temporarily maybe... not sure what would be 
> the best way to move forward, now I stopped the rebalance.
>
> Get Outlook for Android<https://aka.ms/AAb9ysg>
> ________________________________
> From: Anthony D'Atri <a...@dreamsnake.net>
> Sent: Friday, May 23, 2025 8:55:31 PM
> To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
> Cc: Ceph Users <ceph-users@ceph.io>
> Subject: Re: [ceph-users] Newly added node osds getting full before rebalance 
> finished
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> ________________________________
> Octopus?  Those bugs were fixed by Hammer IIRC.
>
> On May 23, 2025, at 9:45 AM, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> 
> wrote:
>
> sorry, typo, 15.2.17
>
> Get Outlook for Android<https://aka.ms/AAb9ysg>
> ________________________________
> From: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
> Sent: Friday, May 23, 2025 8:43:44 PM
> To: Anthony D'Atri <a...@dreamsnake.net>
> Cc: Ceph Users <ceph-users@ceph.io>
> Subject: Re: [ceph-users] Newly added node osds getting full before rebalance 
> finished
>
> 17.2.8
>
> Get Outlook for Android<https://aka.ms/AAb9ysg>
> ________________________________
> From: Anthony D'Atri <a...@dreamsnake.net>
> Sent: Friday, May 23, 2025 8:05:54 PM
> To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
> Cc: Ceph Users <ceph-users@ceph.io>
> Subject: Re: [ceph-users] Newly added node osds getting full before rebalance 
> finished
>
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> ________________________________
>
> For due diligence I have to ask:  are you running a recent Ceph release, not 
> like Firefly?  There were certain bugs back then...
>
>> On May 23, 2025, at 4:07 AM, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> 
>> wrote:
>>
>> Hi,
>>
>> For some reason this is the 2nd time I got into the same issue, added a new 
>> node at the time when in the cluster the most full osd was 73% and all the 
>> newly added osds now started to be between 70-80% full half way on the way 
>> to finish the rebalance.
>
> Did they or their hosts somehow get the wrong CRUSH weight?
>
> `ceph osd df tree` and `ceph osd tree` would give us some insights.
>
>> I don't understand why, could this be because mgr upmap mode with deviation 
>> 1?
>
> No, that’s reasonable especially if you have OSDs of multiple sizes.
>
>> Another question, I have 7 disks in each servers, however in the newly added 
>> one I still have 4 as spare which is not added. Just to get out of this 
>> space issue, would it cause any issue if I add the remaining 4 spare disk
>
> I wouldn’t think so.
>
>> (which will cause unbalanced server size in the cluster, the weight of each 
>> server is 100TB, this would cause only this node to be 160TB).
>
> Depending on your replication/EC profiles, CRUSH rules, and number of failure 
> domains that extra capacity may or may not increase the cluster’s capacity, 
> but for sure it should decrease the average fillage of the OSDs on that host, 
> unless you’re adding them in such a way that skews the host’s CRUSH weight.
>
> Sometimes with a thundering herd of backfill OSDs can temporarily increase in 
> fillage.  When you add OSDs, it’s not entirely the case that only data 
> “moving” to the new OSDs will be remapped, some existing data may be 
> reshuffled as well, though this is much less than with old releases.  Also, 
> when backfilling Ceph will write new replicas before removing displaced 
> replicas, in order to ensure safety.  Usually this doesn’t result in quite 
> the percentages you report, but upmap-remapped can help manage this dynamic.
>
>>
>> Thank you
>>
>> ________________________________
>> This message is confidential and is for the sole use of the intended 
>> recipient(s). It may also be privileged or otherwise protected by copyright 
>> or other legal rules. If you have received it by mistake please let us know 
>> by reply email and delete it from your system. It is prohibited to copy this 
>> message or disclose its content to anyone. Any confidentiality or privilege 
>> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
>> the message. All messages sent to and from Agoda may be monitored to ensure 
>> compliance with company policies, to protect the company's interests and to 
>> remove potential malware. Electronic messages may be intercepted, amended, 
>> lost or deleted, or contain viruses.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Newly added node osds getting full before rebalance finished

Reply via email to