OSD.0 is 150 GB in size but its crush weight is only 0.09769 (10 GB). And you didn't provide the history of commands how you extended exactly the OSDs.

Zitat von listy via ceph-users <[email protected]>:

-> $ ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.001667",
    "last_optimize_started": "Mon Mar  2 04:08:35 2026",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

-> $ ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         1.07428  root default
-3         0.34180      host podster1
 9    ssd  0.04880          osd.9          up   1.00000  1.00000
10    ssd  0.29300          osd.10         up   1.00000  1.00000
-7         0.39069      host podster2
 0    ssd  0.09769          osd.0          up   1.00000  1.00000
 4    ssd  0.29300          osd.4          up   1.00000  1.00000
-5         0.34180      host podster3
 1    ssd  0.04880          osd.1          up   1.00000  1.00000
 5    ssd  0.29300          osd.5          up   1.00000  1.00000

-> $ ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

I've changed upmap_max_deviation = 1
I also did:
-> $ ceph osd reweight osd.0 1.0
and ceased all clients' activity to FSes

I've extend disk-devices ceph uses twice, for 'health' complained about 'backfillull' like here, only _pg_ was there, is is now below:

-> $ ceph health detail
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5 pool(s) backfillfull
[WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)
    osd.0 is backfill full
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull
    pg 3.1b is active+remapped+backfill_toofull, acting [4,5,10]
    pg 3.21 is active+remapped+backfill_toofull, acting [4,5,10]
    pg 3.23 is active+remapped+backfill_toofull, acting [10,5,4]
[WRN] POOL_BACKFILLFULL: 5 pool(s) backfillfull
    pool '.mgr' is backfillfull
    pool 'cephfs.APKI.meta' is backfillfull
    pool 'cephfs.APKI.data' is backfillfull
    pool 'cephfs.MONERO.meta' is backfillfull
    pool 'cephfs.MONERO.data' is backfillfull

Each time I "extended" devices-disks, cluster went to use osd.0 and fill it up.
Yes everything is small in this cluster, it's a lab.

-> $ ceph -w
  cluster:
    id:     9f4f9dba-72c7-11f0-8052-525400519d29
    health: HEALTH_WARN
            1 backfillfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull
            5 pool(s) backfillfull

  services:
    mon: 3 daemons, quorum podster3,podster2,podster1 (age 2d) [leader: podster3]
    mgr: podster1.qzojrl(active, since 2d), standbys: podster3.kyyolr
    mds: 2/2 daemons up, 2 standby
    osd: 6 osds: 6 up (since 12h), 6 in (since 2d); 3 remapped pgs

  data:
    volumes: 2/2 healthy
    pools:   5 pools, 289 pgs
    objects: 102.74k objects, 355 GiB
    usage:   1.1 TiB used, 523 GiB / 1.6 TiB avail
    pgs:     737/308229 objects misplaced (0.239%)
             286 active+clean
             3   active+remapped+backfill_toofull


2026-03-02T04:10:00.000125+0000 mon.podster3 [WRN] overall HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5 pool(s) backfillfull 2026-03-02T04:20:00.000091+0000 mon.podster3 [WRN] overall HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5 pool(s) backfillfull 2026-03-02T04:26:38.174003+0000 mon.podster3 [WRN] Health check failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY) 2026-03-02T04:26:44.210790+0000 mon.podster3 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering) 2026-03-02T04:30:00.000140+0000 mon.podster3 [WRN] overall HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5 pool(s) backfillfull 2026-03-02T04:40:00.000118+0000 mon.podster3 [WRN] overall HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5 pool(s) backfillfull 2026-03-02T04:50:00.000114+0000 mon.podster3 [WRN] overall HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5 pool(s) backfillfull

If the cluster is doing something good, something it should be doing to heal, it's barely happening so slow it is - given no clients work now + as you said, storage capacities are minute in comparison to anything production.
7 hours later, still:
-> $ ceph osd df tree | egrep '(osd.0|ID)'
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME  0    ssd  0.09769   1.00000  150 GiB  140 GiB   88 GiB  1.0 MiB 2.0 GiB   10 GiB  93.29  1.37   72      up          osd.0

Now 5 hours later - since i started this draft - I added:
-> $ ceph osd reweight-by-utilization
and before that cmd above I also noticed:
-> $ ceph config get mgr mgr/balancer/begin_weekday

-> $ ceph config get mgr mgr/balancer/end_weekday

which was done by 'deployment' process - cephadm bootstrap - and made me wonder:
does that mean that auto-rebalance runs only on Sunday?
I changed: end_weekday = 6

_reweight-by-utilization_ I notice, changed REWEIGHT for osd.0 and that did something, I think. So now _active+remapped+backfill_toofull_ are gone from 'pcs' part of health report.
RAW USE & DATA are down, but stil:
-> $ ceph osd df tree | egrep '(osd.0|ID)'
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME  0    ssd  0.09769   0.90002  150 GiB  134 GiB   82 GiB  1.1 MiB 2.0 GiB   16 GiB  89.51  1.31   68      up          osd.0
and when compared to other host-ods which use "identical" disk-drives:
-> $ ceph osd df tree | egrep '(osd\.[0,1,9] |ID)'
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME  9    ssd  0.04880   1.00000  150 GiB   52 GiB   51 GiB  624 KiB 1.4 GiB   98 GiB  34.83  0.51   43      up          osd.9  0    ssd  0.09769   0.90002  150 GiB  134 GiB   82 GiB  1.1 MiB 2.0 GiB   16 GiB  89.51  1.31   68      up          osd.0  1    ssd  0.04880   1.00000  150 GiB   53 GiB   53 GiB  526 KiB 254 MiB   97 GiB  35.40  0.52   44      up          osd.1

Perhaps cluster goes only as far as to satisfy _backfill_toofull_ be gone and then "gives up"?
The "other" disk-drives:
-> $ ceph osd df tree | egrep '(osd\.(5|4|10)\  |ID)'
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME 10    ssd  0.29300   1.00000  400 GiB  307 GiB  305 GiB  2.0 MiB 2.1 GiB   93 GiB  76.68  1.12  246      up          osd.10  4    ssd  0.29300   1.00000  400 GiB  275 GiB  273 GiB  3.2 MiB 2.1 GiB  125 GiB  68.83  1.01  221      up          osd.4  5    ssd  0.29300   1.00000  400 GiB  305 GiB  303 GiB  3.0 MiB 2.5 GiB   95 GiB  76.27  1.12  245      up          osd.5

Seems that _host podster2_ balances its osds 4 & 0 "differently" to what other two hosts do - if so then why?
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to