They provided me the following:

root@cmt6770:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME         STATUS  REWEIGHT  PRI-AFF
 -1         34.05125  root default
 -3          5.38539      host cmt5923
  3    ssd   1.74660          osd.3         up   0.79999  1.00000
  8    ssd   1.81940          osd.8       down   1.00000  1.00000
  9    ssd   1.81940          osd.9       down   1.00000  1.00000
-15          4.40289      host cmt6461
 24   nvme   0.90970          osd.24      down   0.79999  1.00000
  2    ssd   1.74660          osd.2       down   1.00000  1.00000
 17    ssd   1.74660          osd.17      down   1.00000  1.00000
 -5          5.35616      host cmt6770
  4   nvme   1.86299          osd.4       down   1.00000  1.00000
  0    ssd   0.87329          osd.0       down   1.00000  1.00000
  1    ssd   0.87329          osd.1       down   1.00000  1.00000
 14    ssd   0.87329          osd.14      down   1.00000  1.00000
 15    ssd   0.87329          osd.15        up   1.00000  1.00000
 -9          7.24838      host cmt7773
  5   nvme   1.81940          osd.5       down   1.00000  1.00000
 19   nvme   1.81940          osd.19      down   0.95001  1.00000
  7    ssd   1.74660          osd.7       down   1.00000  1.00000
 29    ssd   1.86299          osd.29        up   1.00000  1.00000
-13          7.93245      host dc2943
 22   nvme   0.90970          osd.22        up   1.00000  1.00000
 23   nvme   0.90970          osd.23        up   1.00000  1.00000
  6    ssd   1.74660          osd.6       down   1.00000  1.00000
 10    ssd   0.87329          osd.10        up   1.00000  1.00000
 11    ssd   0.87329          osd.11        up   1.00000  1.00000
 12    ssd   0.87329          osd.12        up   1.00000  1.00000
 13    ssd   0.87329          osd.13        up   1.00000  1.00000
 16    ssd   0.87329          osd.16        up   0.79999  1.00000
-11          3.72598      host dc3658
 20   nvme   1.86299          osd.20      down   0.95001  1.00000
 21   nvme   1.86299          osd.21        up   0.90002  1.00000
root@cmt6770:~# ^C
root@cmt6770:~#

From: Michel Raabe <ra...@b1-systems.de>
Sent: Saturday, May 10, 2025 3:57 PM
To: Senol COLAK <se...@kubedo.com>
Cc: ceph-users@ceph.io <ceph-users@ceph.io>
Subject: Re: [ceph-users] We lost the stability of the cluster, 18.2.2 -> 18.2.6 -> 19.2.1 Chain of upgrade failure
 
Hi,

Can you provide a „ceph osd df tree“ output?

Regards
Michel

Sent from my mobile phone

> On 10. May 2025, at 14:56, Senol COLAK <se...@kubedo.com> wrote:
>
> Hello,
>
> After upgrading from ceph reef 18.2.6 to ceph squid 19.2.1 I restarted the osds and they remained down. The events contain the following records:
>
> root@cmt6770:~# ceph health detail
> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; mon cmt6770 is low on available space; 9 osds down; 1 host (3 osds) down; 5 nearfull osd(s); Reduced data availability: 866 pgs inactive, 489 pgs down, 5 pgs incomplete, 60 pgs stale; Low space hindering backfill (add storage if this doesn't resolve itself): 9 pgs backfill_toofull; Degraded data redundancy: 432856/3408880 objects degraded (12.698%), 191 pgs degraded, 181 pgs undersized; 12 pool(s) nearfull; 255 slow ops, oldest one blocked for 2417 sec, daemons [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
> [WRN] FS_DEGRADED: 1 filesystem is degraded
> fs cephfs is degraded
> [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
> mds.cmt5923(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 38201 secs
> [WRN] MON_DISK_LOW: mon cmt6770 is low on available space
> mon.cmt6770 has 28% avail
> [WRN] OSD_DOWN: 9 osds down
> osd.0 (root=default,host=cmt6770) is down
> osd.1 (root=default,host=cmt6770) is down
> osd.2 (root=default,host=cmt6461) is down
> osd.7 (root=default,host=cmt7773) is down
> osd.8 (root=default,host=cmt5923) is down
> osd.9 (root=default,host=cmt5923) is down
> osd.14 (root=default,host=cmt6770) is down
> osd.17 (root=default,host=cmt6461) is down
> osd.24 (root=default,host=cmt6461) is down
> [WRN] OSD_HOST_DOWN: 1 host (3 osds) down
> host cmt6461 (root=default) (3 osds) is down
> [WRN] OSD_NEARFULL: 5 nearfull osd(s)
> osd.3 is near full
> osd.12 is near full
> osd.16 is near full
> osd.21 is near full
> osd.23 is near full
> [WRN] PG_AVAILABILITY: Reduced data availability: 866 pgs inactive, 489 pgs down, 5 pgs incomplete, 60 pgs stale
> pg 7.1c5 is down, acting [3]
> pg 7.1c7 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1c8 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1cb is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1cd is down, acting [15,3]
> pg 7.1ce is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1cf is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1d0 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1d1 is down, acting [29,13]
> pg 7.1d2 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1d3 is down, acting [23]
> pg 7.1d4 is down, acting [16]
> pg 7.1d5 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1d6 is down, acting [3]
> pg 7.1d9 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1da is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1e0 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1e1 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1e2 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1e4 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1e5 is down, acting [12,29]
> pg 7.1e7 is stuck stale for 7m, current state stale+down, last acting [9]
> pg 7.1e8 is down, acting [12]
> pg 7.1e9 is stuck stale for 31m, current state stale, last acting [2]
> pg 7.1eb is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1ed is down, acting [3]
> pg 7.1ee is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1ef is down, acting [12]
> pg 7.1f0 is down, acting [10]
> pg 7.1f1 is down, acting [12,29]
> pg 7.1f2 is down, acting [16]
> pg 7.1f3 is stuck stale for 7m, current state stale, last acting [9]
> pg 7.1f4 is down, acting [22]
> pg 7.1f5 is down, acting [22]
> pg 7.1f8 is down, acting [29]
> pg 7.1f9 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1fb is stuck inactive for 3h, current state unknown, last acting []
> pg 7.1fc is stuck stale for 15m, current state stale+down, last acting [29]
> pg 7.1fd is down, acting [3]
> pg 7.1fe is down, acting [3,15]
> pg 7.1ff is down, acting [3]
> pg 7.201 is down, acting [12]
> pg 7.204 is down, acting [10]
> pg 7.205 is down, acting [13]
> pg 7.207 is down, acting [11]
> pg 7.20a is down, acting [3]
> pg 7.20b is down, acting [22]
> pg 7.20d is stuck inactive for 3h, current state unknown, last acting []
> pg 7.210 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.211 is stuck inactive for 3h, current state unknown, last acting []
> pg 7.21b is down, acting [16]
> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 9 pgs backfill_toofull
> pg 7.20 is active+undersized+degraded+remapped+backfill_toofull, acting [3]
> pg 19.2b is active+remapped+backfill_toofull, acting [12,3]
> pg 19.6b is active+remapped+backfill_toofull, acting [12,3]
> pg 20.55 is active+remapped+backfill_toofull, acting [29,3]
> pg 24.6 is active+undersized+degraded+remapped+backfill_toofull, acting [22]
> pg 24.b is active+undersized+degraded+remapped+backfill_toofull, acting [21]
> pg 24.13 is active+undersized+degraded+remapped+backfill_toofull, acting [23]
> pg 24.16 is active+undersized+degraded+remapped+backfill_toofull, acting [21]
> pg 24.1d is active+undersized+degraded+remapped+backfill_toofull, acting [21]
> [WRN] PG_DEGRADED: Degraded data redundancy: 432856/3408880 objects degraded (12.698%), 191 pgs degraded, 181 pgs undersized
> pg 7.3f is stuck undersized for 9m, current state active+undersized+degraded, last acting [16]
> pg 7.56 is stuck undersized for 21m, current state active+undersized+degraded, last acting [13]
> pg 7.61 is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.66 is stuck undersized for 7m, current state active+undersized+degraded, last acting [29]
> pg 7.6b is stuck undersized for 25m, current state active+undersized+degraded, last acting [21]
> pg 7.102 is active+undersized+degraded, acting [23]
> pg 7.118 is stuck undersized for 14m, current state active+undersized+degraded, last acting [23]
> pg 7.11c is stuck undersized for 9m, current state active+undersized+degraded, last acting [3]
> pg 7.133 is stuck undersized for 7m, current state active+undersized+degraded, last acting [12]
> pg 7.139 is stuck undersized for 7m, current state active+undersized+degraded, last acting [13]
> pg 7.143 is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.155 is stuck undersized for 25m, current state active+undersized+degraded, last acting [3]
> pg 7.156 is active+undersized+degraded, acting [12]
> pg 7.15e is stuck undersized for 31m, current state active+undersized+degraded, last acting [3]
> pg 7.15f is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.168 is stuck undersized for 67m, current state active+undersized+degraded, last acting [22]
> pg 7.17f is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.180 is stuck undersized for 14m, current state active+undersized+degraded, last acting [21]
> pg 7.18e is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.193 is active+undersized+degraded, acting [16]
> pg 7.197 is stuck undersized for 14m, current state active+undersized+degraded, last acting [21]
> pg 7.1a6 is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.1b7 is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 7.1c6 is stuck undersized for 14m, current state active+undersized+degraded, last acting [22]
> pg 7.1ca is stuck undersized for 14m, current state active+undersized+degraded, last acting [22]
> pg 7.1d7 is stuck undersized for 9m, current state active+undersized+degraded, last acting [22]
> pg 7.1df is active+undersized+degraded, acting [21]
> pg 7.1e6 is stuck undersized for 10h, current state active+undersized+degraded, last acting [23]
> pg 7.200 is active+undersized+degraded, acting [29]
> pg 7.202 is stuck undersized for 7m, current state active+undersized+degraded, last acting [13]
> pg 7.20c is stuck undersized for 10h, current state active+undersized+degraded, last acting [16]
> pg 7.20e is stuck undersized for 47m, current state active+undersized+degraded, last acting [23]
> pg 7.20f is stuck undersized for 47m, current state active+undersized+degraded, last acting [23]
> pg 7.217 is stuck undersized for 7m, current state active+undersized+degraded, last acting [21]
> pg 15.35 is active+undersized+degraded, acting [22]
> pg 16.2a is stuck undersized for 10h, current state active+undersized+degraded, last acting [21]
> pg 19.43 is stuck undersized for 31m, current state active+undersized+degraded, last acting [23]
> pg 19.44 is stuck undersized for 8m, current state active+undersized+degraded, last acting [29]
> pg 19.4e is stuck undersized for 14m, current state active+undersized+degraded, last acting [16]
> pg 19.52 is active+undersized+degraded+wait, acting [3]
> pg 19.55 is stuck undersized for 25m, current state active+undersized+degraded, last acting [23]
> pg 19.61 is stuck undersized for 25m, current state active+undersized+degraded, last acting [21]
> pg 19.72 is stuck undersized for 31m, current state active+undersized+degraded, last acting [3]
> pg 20.42 is stuck undersized for 7m, current state active+undersized+degraded, last acting [23]
> pg 20.48 is stuck undersized for 67m, current state active+undersized+degraded, last acting [16]
> pg 20.5b is stuck undersized for 21m, current state active+undersized+degraded, last acting [12]
> pg 20.5f is stuck undersized for 10h, current state active+undersized+degraded, last acting [12]
> pg 20.65 is stuck undersized for 10m, current state active+undersized+degraded, last acting [23]
> pg 20.6a is active+undersized+degraded, acting [13]
> pg 20.71 is stuck undersized for 31m, current state active+undersized+degraded, last acting [13]
> pg 20.7d is stuck undersized for 7m, current state active+undersized+degraded, last acting [29]
> [WRN] POOL_NEARFULL: 12 pool(s) nearfull
> pool '.mgr' is nearfull
> pool 'DataStore' is nearfull
> pool 'cephfs_data' is nearfull
> pool 'cephfs_metadata' is nearfull
> pool 'OS' is nearfull
> pool 'cloud' is nearfull
> pool 'DataStore_2' is nearfull
> pool 'DataStore_3' is nearfull
> pool 'MGMT' is nearfull
> pool 'DataStore_4' is nearfull
> pool 'DataStore_5' is nearfull
> pool 'fast' is nearfull
> [WRN] SLOW_OPS: 255 slow ops, oldest one blocked for 2417 sec, daemons [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
> root@cmt6770:~# ceph -s
> cluster:
> id: 9319dafb-3408-46cb-9b09-b3d381114545
> health: HEALTH_WARN
> 1 filesystem is degraded
> 1 MDSs report slow metadata IOs
> mon cmt6770 is low on available space
> 9 osds down
> 1 host (3 osds) down
> 5 nearfull osd(s)
> Reduced data availability: 866 pgs inactive, 489 pgs down, 5 pgs incomplete, 60 pgs stale
> Low space hindering backfill (add storage if this doesn't resolve itself): 9 pgs backfill_toofull
> Degraded data redundancy: 432856/3408880 objects degraded (12.698%), 191 pgs degraded, 181 pgs undersized
> 12 pool(s) nearfull
> 255 slow ops, oldest one blocked for 2422 sec, daemons [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
>
> services:
> mon: 2 daemons, quorum cmt6770,cmt5923 (age 70m)
> mgr: cmt6770(active, since 3h)
> mds: 1/1 daemons up, 1 standby
> osd: 25 osds: 11 up (since 14s), 20 in (since 9m); 182 remapped pgs
>
> data:
> volumes: 0/1 healthy, 1 recovering
> pools: 12 pools, 1589 pgs
> objects: 1.70M objects, 6.2 TiB
> usage: 9.1 TiB used, 6.2 TiB / 15 TiB avail
> pgs: 28.760% pgs unknown
> 34.991% pgs not active
> 432856/3408880 objects degraded (12.698%)
> 388136/3408880 objects misplaced (11.386%)
> 466 down
> 457 unknown
> 209 active+clean
> 185 active+undersized+degraded
> 157 active+clean+remapped
> 62 stale
> 20 stale+down
> 9 active+undersized+remapped
> 6 active+undersized+degraded+remapped+backfill_toofull
> 5 incomplete
> 3 active+remapped+backfill_toofull
> 2 active+clean+scrubbing+deep
> 2 active+clean+remapped+scrubbing+deep
> 2 down+remapped
> 1 stale+creating+down
> 1 active+remapped+backfilling
> 1 active+remapped+backfill_wait
> 1 active+undersized+remapped+wait
>
> io:
> recovery: 12 MiB/s, 3 objects/s
>
> root@cmt6770:~# ceph osd tree
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -1 34.05125 root default
> -3 5.38539 host cmt5923
> 3 ssd 1.74660 osd.3 up 0.79999 1.00000
> 8 ssd 1.81940 osd.8 down 1.00000 1.00000
> 9 ssd 1.81940 osd.9 down 1.00000 1.00000
> -15 4.40289 host cmt6461
> 24 nvme 0.90970 osd.24 down 0.79999 1.00000
> 2 ssd 1.74660 osd.2 down 1.00000 1.00000
> 17 ssd 1.74660 osd.17 down 1.00000 1.00000
> -5 5.35616 host cmt6770
> 0 ssd 0.87329 osd.0 down 1.00000 1.00000
> 1 ssd 0.87329 osd.1 down 1.00000 1.00000
> 4 ssd 1.86299 osd.4 down 0 1.00000
> 14 ssd 0.87329 osd.14 down 1.00000 1.00000
> 15 ssd 0.87329 osd.15 up 1.00000 1.00000
> -9 7.24838 host cmt7773
> 5 nvme 1.81940 osd.5 down 0 1.00000
> 19 nvme 1.81940 osd.19 down 0 1.00000
> 7 ssd 1.74660 osd.7 down 1.00000 1.00000
> 29 ssd 1.86299 osd.29 up 1.00000 1.00000
> -13 7.93245 host dc2943
> 22 nvme 0.90970 osd.22 up 1.00000 1.00000
> 23 nvme 0.90970 osd.23 up 1.00000 1.00000
> 6 ssd 1.74660 osd.6 down 0 1.00000
> 10 ssd 0.87329 osd.10 up 1.00000 1.00000
> 11 ssd 0.87329 osd.11 up 1.00000 1.00000
> 12 ssd 0.87329 osd.12 up 1.00000 1.00000
> 13 ssd 0.87329 osd.13 up 1.00000 1.00000
> 16 ssd 0.87329 osd.16 up 0.79999 1.00000
> -11 3.72598 host dc3658
> 20 nvme 1.86299 osd.20 down 0 1.00000
> 21 nvme 1.86299 osd.21 up 0.90002 1.00000
> root@cmt6770:~# ceph osd df
> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
> 3 ssd 1.74660 0.79999 1.7 TiB 1.5 TiB 1.5 TiB 228 KiB 2.9 GiB 251 GiB 85.96 1.44 214 up
> 8 ssd 1.81940 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 9 ssd 1.81940 1.00000 1.8 TiB 776 MiB 745 MiB 8 KiB 31 MiB 1.8 TiB 0.04 0 46 down
> 24 nvme 0.90970 0.79999 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 2 ssd 1.74660 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 4 down
> 17 ssd 1.74660 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 0 ssd 0.87329 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 3 down
> 1 ssd 0.87329 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 4 ssd 1.86299 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 14 ssd 0.87329 1.00000 894 GiB 792 MiB 752 MiB 14 KiB 40 MiB 893 GiB 0.09 0.00 25 down
> 15 ssd 0.87329 1.00000 894 GiB 232 GiB 231 GiB 14 KiB 1.4 GiB 662 GiB 25.98 0.44 83 up
> 5 nvme 1.81940 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 19 nvme 1.81940 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 7 ssd 1.74660 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 29 ssd 1.86299 1.00000 1.9 TiB 1.5 TiB 1.5 TiB 323 KiB 2.8 GiB 354 GiB 81.44 1.37 222 up
> 22 nvme 0.90970 1.00000 932 GiB 689 GiB 687 GiB 181 KiB 1.6 GiB 243 GiB 73.96 1.24 139 up
> 23 nvme 0.90970 1.00000 932 GiB 820 GiB 818 GiB 138 KiB 2.0 GiB 112 GiB 87.98 1.48 144 up
> 6 ssd 1.74660 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 10 ssd 0.87329 1.00000 894 GiB 237 GiB 235 GiB 1 KiB 1.2 GiB 658 GiB 26.46 0.44 82 up
> 11 ssd 0.87329 1.00000 894 GiB 264 GiB 263 GiB 1 KiB 1.4 GiB 630 GiB 29.54 0.50 67 up
> 12 ssd 0.87329 1.00000 894 GiB 780 GiB 778 GiB 123 KiB 1.8 GiB 114 GiB 87.26 1.46 113 up
> 13 ssd 0.87329 1.00000 894 GiB 684 GiB 682 GiB 170 KiB 1.9 GiB 210 GiB 76.48 1.28 98 up
> 16 ssd 0.87329 0.79999 894 GiB 779 GiB 777 GiB 149 KiB 1.8 GiB 116 GiB 87.06 1.46 86 up
> 20 nvme 1.86299 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
> 21 nvme 1.86299 0.90002 1.9 TiB 1.7 TiB 1.7 TiB 430 KiB 3.5 GiB 194 GiB 89.84 1.51 314 up
> TOTAL 15 TiB 9.1 TiB 9.1 TiB 1.7 MiB 22 GiB 6.2 TiB 59.60
> MIN/MAX VAR: 0/1.51 STDDEV: 44.89
>
>
> also osd start logs:
> May 10 15:38:32 cmt5923 systemd[1]: ceph-osd@9.service: Failed with result 'signal'.
> May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.579+0300 764caf13f880 -1 osd.8 100504 log_to_monitors true
> May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 764c8c64b6c0 -1 log_channel(cluster) log [ERR] : 7.26a past_intervals [96946,100253) start interval does not contain the required bound [93903,100253) start
> May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 764c8c64b6c0 -1 osd.8 pg_epoch: 100377 pg[7.26a( empty local-lis/les=0/0 n=0 ec=96946/96946 lis/c=96236/93898 les/c/f=96237/93903/91308 sis=100253) [3,1] r=-1 lpr=100376 pi=[96946,100253)/3 crt=0'0 mlcod 0'0 unknown mbc={}] PeeringState::check_past_interval_bounds 7.26a past_intervals [96946,100253) start interval does not contain the required bound [93903,100253) start
>
> We appreciate any support and guidance,
> Thanks in advance
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to