19.2.1 Chain of upgrade failure

Senol COLAK Sat, 10 May 2025 05:54:58 -0700

Hello,

After upgrading from ceph reef 18.2.6 to ceph squid 19.2.1 I restarted the osds 
and they remained down. The events contain the following records:


root@cmt6770:~# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; mon 
cmt6770 is low on available space; 9 osds down; 1 host (3 osds) down; 5 
nearfull osd(s); Reduced data availability: 866 pgs inactive, 489 pgs down, 5 
pgs incomplete, 60 pgs stale; Low space hindering backfill (add storage if this 
doesn't resolve itself): 9 pgs backfill_toofull; Degraded data redundancy: 
432856/3408880 objects degraded (12.698%), 191 pgs degraded, 181 pgs 
undersized; 12 pool(s) nearfull; 255 slow ops, oldest one blocked for 2417 sec, 
daemons [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.cmt5923(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest 
blocked for 38201 secs
[WRN] MON_DISK_LOW: mon cmt6770 is low on available space
    mon.cmt6770 has 28% avail
[WRN] OSD_DOWN: 9 osds down
    osd.0 (root=default,host=cmt6770) is down
    osd.1 (root=default,host=cmt6770) is down
    osd.2 (root=default,host=cmt6461) is down
    osd.7 (root=default,host=cmt7773) is down
    osd.8 (root=default,host=cmt5923) is down
    osd.9 (root=default,host=cmt5923) is down
    osd.14 (root=default,host=cmt6770) is down
    osd.17 (root=default,host=cmt6461) is down
    osd.24 (root=default,host=cmt6461) is down
[WRN] OSD_HOST_DOWN: 1 host (3 osds) down
    host cmt6461 (root=default) (3 osds) is down
[WRN] OSD_NEARFULL: 5 nearfull osd(s)
    osd.3 is near full
    osd.12 is near full
    osd.16 is near full
    osd.21 is near full
    osd.23 is near full
[WRN] PG_AVAILABILITY: Reduced data availability: 866 pgs inactive, 489 pgs 
down, 5 pgs incomplete, 60 pgs stale
    pg 7.1c5 is down, acting [3]
    pg 7.1c7 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1c8 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1cb is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1cd is down, acting [15,3]
    pg 7.1ce is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1cf is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1d0 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1d1 is down, acting [29,13]
    pg 7.1d2 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1d3 is down, acting [23]
    pg 7.1d4 is down, acting [16]
    pg 7.1d5 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1d6 is down, acting [3]
    pg 7.1d9 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1da is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1e0 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1e1 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1e2 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1e4 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1e5 is down, acting [12,29]
    pg 7.1e7 is stuck stale for 7m, current state stale+down, last acting [9]
    pg 7.1e8 is down, acting [12]
    pg 7.1e9 is stuck stale for 31m, current state stale, last acting [2]
    pg 7.1eb is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1ed is down, acting [3]
    pg 7.1ee is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1ef is down, acting [12]
    pg 7.1f0 is down, acting [10]
    pg 7.1f1 is down, acting [12,29]
    pg 7.1f2 is down, acting [16]
    pg 7.1f3 is stuck stale for 7m, current state stale, last acting [9]
    pg 7.1f4 is down, acting [22]
    pg 7.1f5 is down, acting [22]
    pg 7.1f8 is down, acting [29]
    pg 7.1f9 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1fb is stuck inactive for 3h, current state unknown, last acting []
    pg 7.1fc is stuck stale for 15m, current state stale+down, last acting [29]
    pg 7.1fd is down, acting [3]
    pg 7.1fe is down, acting [3,15]
    pg 7.1ff is down, acting [3]
    pg 7.201 is down, acting [12]
    pg 7.204 is down, acting [10]
    pg 7.205 is down, acting [13]
    pg 7.207 is down, acting [11]
    pg 7.20a is down, acting [3]
    pg 7.20b is down, acting [22]
    pg 7.20d is stuck inactive for 3h, current state unknown, last acting []
    pg 7.210 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.211 is stuck inactive for 3h, current state unknown, last acting []
    pg 7.21b is down, acting [16]
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this 
doesn't resolve itself): 9 pgs backfill_toofull
    pg 7.20 is active+undersized+degraded+remapped+backfill_toofull, acting [3]
    pg 19.2b is active+remapped+backfill_toofull, acting [12,3]
    pg 19.6b is active+remapped+backfill_toofull, acting [12,3]
    pg 20.55 is active+remapped+backfill_toofull, acting [29,3]
    pg 24.6 is active+undersized+degraded+remapped+backfill_toofull, acting [22]
    pg 24.b is active+undersized+degraded+remapped+backfill_toofull, acting [21]
    pg 24.13 is active+undersized+degraded+remapped+backfill_toofull, acting 
[23]
    pg 24.16 is active+undersized+degraded+remapped+backfill_toofull, acting 
[21]
    pg 24.1d is active+undersized+degraded+remapped+backfill_toofull, acting 
[21]
[WRN] PG_DEGRADED: Degraded data redundancy: 432856/3408880 objects degraded 
(12.698%), 191 pgs degraded, 181 pgs undersized
    pg 7.3f is stuck undersized for 9m, current state 
active+undersized+degraded, last acting [16]
    pg 7.56 is stuck undersized for 21m, current state 
active+undersized+degraded, last acting [13]
    pg 7.61 is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.66 is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [29]
    pg 7.6b is stuck undersized for 25m, current state 
active+undersized+degraded, last acting [21]
    pg 7.102 is active+undersized+degraded, acting [23]
    pg 7.118 is stuck undersized for 14m, current state 
active+undersized+degraded, last acting [23]
    pg 7.11c is stuck undersized for 9m, current state 
active+undersized+degraded, last acting [3]
    pg 7.133 is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [12]
    pg 7.139 is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [13]
    pg 7.143 is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.155 is stuck undersized for 25m, current state 
active+undersized+degraded, last acting [3]
    pg 7.156 is active+undersized+degraded, acting [12]
    pg 7.15e is stuck undersized for 31m, current state 
active+undersized+degraded, last acting [3]
    pg 7.15f is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.168 is stuck undersized for 67m, current state 
active+undersized+degraded, last acting [22]
    pg 7.17f is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.180 is stuck undersized for 14m, current state 
active+undersized+degraded, last acting [21]
    pg 7.18e is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.193 is active+undersized+degraded, acting [16]
    pg 7.197 is stuck undersized for 14m, current state 
active+undersized+degraded, last acting [21]
    pg 7.1a6 is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.1b7 is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 7.1c6 is stuck undersized for 14m, current state 
active+undersized+degraded, last acting [22]
    pg 7.1ca is stuck undersized for 14m, current state 
active+undersized+degraded, last acting [22]
    pg 7.1d7 is stuck undersized for 9m, current state 
active+undersized+degraded, last acting [22]
    pg 7.1df is active+undersized+degraded, acting [21]
    pg 7.1e6 is stuck undersized for 10h, current state 
active+undersized+degraded, last acting [23]
    pg 7.200 is active+undersized+degraded, acting [29]
    pg 7.202 is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [13]
    pg 7.20c is stuck undersized for 10h, current state 
active+undersized+degraded, last acting [16]
    pg 7.20e is stuck undersized for 47m, current state 
active+undersized+degraded, last acting [23]
    pg 7.20f is stuck undersized for 47m, current state 
active+undersized+degraded, last acting [23]
    pg 7.217 is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [21]
    pg 15.35 is active+undersized+degraded, acting [22]
    pg 16.2a is stuck undersized for 10h, current state 
active+undersized+degraded, last acting [21]
    pg 19.43 is stuck undersized for 31m, current state 
active+undersized+degraded, last acting [23]
    pg 19.44 is stuck undersized for 8m, current state 
active+undersized+degraded, last acting [29]
    pg 19.4e is stuck undersized for 14m, current state 
active+undersized+degraded, last acting [16]
    pg 19.52 is active+undersized+degraded+wait, acting [3]
    pg 19.55 is stuck undersized for 25m, current state 
active+undersized+degraded, last acting [23]
    pg 19.61 is stuck undersized for 25m, current state 
active+undersized+degraded, last acting [21]
    pg 19.72 is stuck undersized for 31m, current state 
active+undersized+degraded, last acting [3]
    pg 20.42 is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [23]
    pg 20.48 is stuck undersized for 67m, current state 
active+undersized+degraded, last acting [16]
    pg 20.5b is stuck undersized for 21m, current state 
active+undersized+degraded, last acting [12]
    pg 20.5f is stuck undersized for 10h, current state 
active+undersized+degraded, last acting [12]
    pg 20.65 is stuck undersized for 10m, current state 
active+undersized+degraded, last acting [23]
    pg 20.6a is active+undersized+degraded, acting [13]
    pg 20.71 is stuck undersized for 31m, current state 
active+undersized+degraded, last acting [13]
    pg 20.7d is stuck undersized for 7m, current state 
active+undersized+degraded, last acting [29]
[WRN] POOL_NEARFULL: 12 pool(s) nearfull
    pool '.mgr' is nearfull
    pool 'DataStore' is nearfull
    pool 'cephfs_data' is nearfull
    pool 'cephfs_metadata' is nearfull
    pool 'OS' is nearfull
    pool 'cloud' is nearfull
    pool 'DataStore_2' is nearfull
    pool 'DataStore_3' is nearfull
    pool 'MGMT' is nearfull
    pool 'DataStore_4' is nearfull
    pool 'DataStore_5' is nearfull
    pool 'fast' is nearfull
[WRN] SLOW_OPS: 255 slow ops, oldest one blocked for 2417 sec, daemons 
[osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
root@cmt6770:~# ceph -s
  cluster:
    id:     9319dafb-3408-46cb-9b09-b3d381114545
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            mon cmt6770 is low on available space
            9 osds down
            1 host (3 osds) down
            5 nearfull osd(s)
            Reduced data availability: 866 pgs inactive, 489 pgs down, 5 pgs 
incomplete, 60 pgs stale
            Low space hindering backfill (add storage if this doesn't resolve 
itself): 9 pgs backfill_toofull
            Degraded data redundancy: 432856/3408880 objects degraded 
(12.698%), 191 pgs degraded, 181 pgs undersized
            12 pool(s) nearfull
            255 slow ops, oldest one blocked for 2422 sec, daemons 
[osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.

  services:
    mon: 2 daemons, quorum cmt6770,cmt5923 (age 70m)
    mgr: cmt6770(active, since 3h)
    mds: 1/1 daemons up, 1 standby
    osd: 25 osds: 11 up (since 14s), 20 in (since 9m); 182 remapped pgs

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   12 pools, 1589 pgs
    objects: 1.70M objects, 6.2 TiB
    usage:   9.1 TiB used, 6.2 TiB / 15 TiB avail
    pgs:     28.760% pgs unknown
             34.991% pgs not active
             432856/3408880 objects degraded (12.698%)
             388136/3408880 objects misplaced (11.386%)
             466 down
             457 unknown
             209 active+clean
             185 active+undersized+degraded
             157 active+clean+remapped
             62  stale
             20  stale+down
             9   active+undersized+remapped
             6   active+undersized+degraded+remapped+backfill_toofull
             5   incomplete
             3   active+remapped+backfill_toofull
             2   active+clean+scrubbing+deep
             2   active+clean+remapped+scrubbing+deep
             2   down+remapped
             1   stale+creating+down
             1   active+remapped+backfilling
             1   active+remapped+backfill_wait
             1   active+undersized+remapped+wait

  io:
    recovery: 12 MiB/s, 3 objects/s

root@cmt6770:~# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME         STATUS  REWEIGHT  PRI-AFF
 -1         34.05125  root default
 -3          5.38539      host cmt5923
  3    ssd   1.74660          osd.3         up   0.79999  1.00000
  8    ssd   1.81940          osd.8       down   1.00000  1.00000
  9    ssd   1.81940          osd.9       down   1.00000  1.00000
-15          4.40289      host cmt6461
 24   nvme   0.90970          osd.24      down   0.79999  1.00000
  2    ssd   1.74660          osd.2       down   1.00000  1.00000
 17    ssd   1.74660          osd.17      down   1.00000  1.00000
 -5          5.35616      host cmt6770
  0    ssd   0.87329          osd.0       down   1.00000  1.00000
  1    ssd   0.87329          osd.1       down   1.00000  1.00000
  4    ssd   1.86299          osd.4       down         0  1.00000
 14    ssd   0.87329          osd.14      down   1.00000  1.00000
 15    ssd   0.87329          osd.15        up   1.00000  1.00000
 -9          7.24838      host cmt7773
  5   nvme   1.81940          osd.5       down         0  1.00000
 19   nvme   1.81940          osd.19      down         0  1.00000
  7    ssd   1.74660          osd.7       down   1.00000  1.00000
 29    ssd   1.86299          osd.29        up   1.00000  1.00000
-13          7.93245      host dc2943
 22   nvme   0.90970          osd.22        up   1.00000  1.00000
 23   nvme   0.90970          osd.23        up   1.00000  1.00000
  6    ssd   1.74660          osd.6       down         0  1.00000
 10    ssd   0.87329          osd.10        up   1.00000  1.00000
 11    ssd   0.87329          osd.11        up   1.00000  1.00000
 12    ssd   0.87329          osd.12        up   1.00000  1.00000
 13    ssd   0.87329          osd.13        up   1.00000  1.00000
 16    ssd   0.87329          osd.16        up   0.79999  1.00000
-11          3.72598      host dc3658
 20   nvme   1.86299          osd.20      down         0  1.00000
 21   nvme   1.86299          osd.21        up   0.90002  1.00000
root@cmt6770:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     
AVAIL    %USE   VAR   PGS  STATUS
 3    ssd  1.74660   0.79999  1.7 TiB  1.5 TiB  1.5 TiB  228 KiB  2.9 GiB  251 
GiB  85.96  1.44  214      up
 8    ssd  1.81940   1.00000      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
 9    ssd  1.81940   1.00000  1.8 TiB  776 MiB  745 MiB    8 KiB   31 MiB  1.8 
TiB   0.04     0   46    down
24   nvme  0.90970   0.79999      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
 2    ssd  1.74660   1.00000      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    4    down
17    ssd  1.74660   1.00000      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
 0    ssd  0.87329   1.00000      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    3    down
 1    ssd  0.87329   1.00000      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
 4    ssd  1.86299         0      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
14    ssd  0.87329   1.00000  894 GiB  792 MiB  752 MiB   14 KiB   40 MiB  893 
GiB   0.09  0.00   25    down
15    ssd  0.87329   1.00000  894 GiB  232 GiB  231 GiB   14 KiB  1.4 GiB  662 
GiB  25.98  0.44   83      up
 5   nvme  1.81940         0      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
19   nvme  1.81940         0      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
 7    ssd  1.74660   1.00000      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
29    ssd  1.86299   1.00000  1.9 TiB  1.5 TiB  1.5 TiB  323 KiB  2.8 GiB  354 
GiB  81.44  1.37  222      up
22   nvme  0.90970   1.00000  932 GiB  689 GiB  687 GiB  181 KiB  1.6 GiB  243 
GiB  73.96  1.24  139      up
23   nvme  0.90970   1.00000  932 GiB  820 GiB  818 GiB  138 KiB  2.0 GiB  112 
GiB  87.98  1.48  144      up
 6    ssd  1.74660         0      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
10    ssd  0.87329   1.00000  894 GiB  237 GiB  235 GiB    1 KiB  1.2 GiB  658 
GiB  26.46  0.44   82      up
11    ssd  0.87329   1.00000  894 GiB  264 GiB  263 GiB    1 KiB  1.4 GiB  630 
GiB  29.54  0.50   67      up
12    ssd  0.87329   1.00000  894 GiB  780 GiB  778 GiB  123 KiB  1.8 GiB  114 
GiB  87.26  1.46  113      up
13    ssd  0.87329   1.00000  894 GiB  684 GiB  682 GiB  170 KiB  1.9 GiB  210 
GiB  76.48  1.28   98      up
16    ssd  0.87329   0.79999  894 GiB  779 GiB  777 GiB  149 KiB  1.8 GiB  116 
GiB  87.06  1.46   86      up
20   nvme  1.86299         0      0 B      0 B      0 B      0 B      0 B      
0 B      0     0    0    down
21   nvme  1.86299   0.90002  1.9 TiB  1.7 TiB  1.7 TiB  430 KiB  3.5 GiB  194 
GiB  89.84  1.51  314      up
                       TOTAL   15 TiB  9.1 TiB  9.1 TiB  1.7 MiB   22 GiB  6.2 
TiB  59.60
MIN/MAX VAR: 0/1.51  STDDEV: 44.89


also osd start logs:
May 10 15:38:32 cmt5923 systemd[1]: ceph-osd@9.service: Failed with result 
'signal'.
May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.579+0300 
764caf13f880 -1 osd.8 100504 log_to_monitors true
May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 
764c8c64b6c0 -1 log_channel(cluster) log [ERR] : 7.26a past_intervals 
[96946,100253) start interval does not contain the required bound 
[93903,100253) start
May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 
764c8c64b6c0 -1 osd.8 pg_epoch: 100377 pg[7.26a( empty local-lis/les=0/0 n=0 
ec=96946/96946 lis/c=96236/93898 les/c/f=96237/93903/91308 sis=100253) [3,1] 
r=-1 lpr=100376 pi=[96946,100253)/3 crt=0'0 mlcod 0'0 unknown mbc={}] 
PeeringState::check_past_interval_bounds 7.26a past_intervals [96946,100253) 
start interval does not contain the required bound [93903,100253) start

We appreciate any support and guidance,
Thanks in advance 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] We lost the stability of the cluster, 18.2.2 -> 18.2.6 -> 19.2.1 Chain of upgrade failure

Reply via email to