-> $ ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.001667",
"last_optimize_started": "Mon Mar 2 04:08:35 2026",
"mode": "upmap",
"no_optimization_needed": true,
"optimize_result": "Unable to find further optimization, or
pool(s) pg_num is decreasing, or distribution is already perfect",
"plans": []
}
-> $ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 1.07428 root default
-3 0.34180 host podster1
9 ssd 0.04880 osd.9 up 1.00000 1.00000
10 ssd 0.29300 osd.10 up 1.00000 1.00000
-7 0.39069 host podster2
0 ssd 0.09769 osd.0 up 1.00000 1.00000
4 ssd 0.29300 osd.4 up 1.00000 1.00000
-5 0.34180 host podster3
1 ssd 0.04880 osd.1 up 1.00000 1.00000
5 ssd 0.29300 osd.5 up 1.00000 1.00000
-> $ ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
I've changed upmap_max_deviation = 1
I also did:
-> $ ceph osd reweight osd.0 1.0
and ceased all clients' activity to FSes
I've extend disk-devices ceph uses twice, for 'health' complained
about 'backfillull' like here, only _pg_ was there, is is now below:
-> $ ceph health detail
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add
storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5
pool(s) backfillfull
[WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)
osd.0 is backfill full
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if
this doesn't resolve itself): 3 pgs backfill_toofull
pg 3.1b is active+remapped+backfill_toofull, acting [4,5,10]
pg 3.21 is active+remapped+backfill_toofull, acting [4,5,10]
pg 3.23 is active+remapped+backfill_toofull, acting [10,5,4]
[WRN] POOL_BACKFILLFULL: 5 pool(s) backfillfull
pool '.mgr' is backfillfull
pool 'cephfs.APKI.meta' is backfillfull
pool 'cephfs.APKI.data' is backfillfull
pool 'cephfs.MONERO.meta' is backfillfull
pool 'cephfs.MONERO.data' is backfillfull
Each time I "extended" devices-disks, cluster went to use osd.0 and
fill it up.
Yes everything is small in this cluster, it's a lab.
-> $ ceph -w
cluster:
id: 9f4f9dba-72c7-11f0-8052-525400519d29
health: HEALTH_WARN
1 backfillfull osd(s)
Low space hindering backfill (add storage if this
doesn't resolve itself): 3 pgs backfill_toofull
5 pool(s) backfillfull
services:
mon: 3 daemons, quorum podster3,podster2,podster1 (age 2d)
[leader: podster3]
mgr: podster1.qzojrl(active, since 2d), standbys: podster3.kyyolr
mds: 2/2 daemons up, 2 standby
osd: 6 osds: 6 up (since 12h), 6 in (since 2d); 3 remapped pgs
data:
volumes: 2/2 healthy
pools: 5 pools, 289 pgs
objects: 102.74k objects, 355 GiB
usage: 1.1 TiB used, 523 GiB / 1.6 TiB avail
pgs: 737/308229 objects misplaced (0.239%)
286 active+clean
3 active+remapped+backfill_toofull
2026-03-02T04:10:00.000125+0000 mon.podster3 [WRN] overall
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add
storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5
pool(s) backfillfull
2026-03-02T04:20:00.000091+0000 mon.podster3 [WRN] overall
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add
storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5
pool(s) backfillfull
2026-03-02T04:26:38.174003+0000 mon.podster3 [WRN] Health check
failed: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2026-03-02T04:26:44.210790+0000 mon.podster3 [INF] Health check
cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg
peering)
2026-03-02T04:30:00.000140+0000 mon.podster3 [WRN] overall
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add
storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5
pool(s) backfillfull
2026-03-02T04:40:00.000118+0000 mon.podster3 [WRN] overall
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add
storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5
pool(s) backfillfull
2026-03-02T04:50:00.000114+0000 mon.podster3 [WRN] overall
HEALTH_WARN 1 backfillfull osd(s); Low space hindering backfill (add
storage if this doesn't resolve itself): 3 pgs backfill_toofull; 5
pool(s) backfillfull
If the cluster is doing something good, something it should be doing
to heal, it's barely happening so slow it is - given no clients work
now + as you said, storage capacities are minute in comparison to
anything production.
7 hours later, still:
-> $ ceph osd df tree | egrep '(osd.0|ID)'
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
0 ssd 0.09769 1.00000 150 GiB 140 GiB 88 GiB 1.0 MiB 2.0
GiB 10 GiB 93.29 1.37 72 up osd.0
Now 5 hours later - since i started this draft - I added:
-> $ ceph osd reweight-by-utilization
and before that cmd above I also noticed:
-> $ ceph config get mgr mgr/balancer/begin_weekday
-> $ ceph config get mgr mgr/balancer/end_weekday
which was done by 'deployment' process - cephadm bootstrap - and
made me wonder:
does that mean that auto-rebalance runs only on Sunday?
I changed: end_weekday = 6
_reweight-by-utilization_ I notice, changed REWEIGHT for osd.0 and
that did something, I think.
So now _active+remapped+backfill_toofull_ are gone from 'pcs' part
of health report.
RAW USE & DATA are down, but stil:
-> $ ceph osd df tree | egrep '(osd.0|ID)'
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
0 ssd 0.09769 0.90002 150 GiB 134 GiB 82 GiB 1.1 MiB 2.0
GiB 16 GiB 89.51 1.31 68 up osd.0
and when compared to other host-ods which use "identical" disk-drives:
-> $ ceph osd df tree | egrep '(osd\.[0,1,9] |ID)'
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
9 ssd 0.04880 1.00000 150 GiB 52 GiB 51 GiB 624 KiB 1.4
GiB 98 GiB 34.83 0.51 43 up osd.9
0 ssd 0.09769 0.90002 150 GiB 134 GiB 82 GiB 1.1 MiB 2.0
GiB 16 GiB 89.51 1.31 68 up osd.0
1 ssd 0.04880 1.00000 150 GiB 53 GiB 53 GiB 526 KiB 254
MiB 97 GiB 35.40 0.52 44 up osd.1
Perhaps cluster goes only as far as to satisfy _backfill_toofull_ be
gone and then "gives up"?
The "other" disk-drives:
-> $ ceph osd df tree | egrep '(osd\.(5|4|10)\ |ID)'
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
10 ssd 0.29300 1.00000 400 GiB 307 GiB 305 GiB 2.0 MiB 2.1
GiB 93 GiB 76.68 1.12 246 up osd.10
4 ssd 0.29300 1.00000 400 GiB 275 GiB 273 GiB 3.2 MiB 2.1
GiB 125 GiB 68.83 1.01 221 up osd.4
5 ssd 0.29300 1.00000 400 GiB 305 GiB 303 GiB 3.0 MiB 2.5
GiB 95 GiB 76.27 1.12 245 up osd.5
Seems that _host podster2_ balances its osds 4 & 0 "differently" to
what other two hosts do - if so then why?
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]