19.2.1 Chain of upgrade failure

Michel Raabe Sat, 10 May 2025 06:59:00 -0700

Hi,

Can you provide a „ceph osd df tree“ output?


Regards
Michel

Sent from my mobile phone

> On 10. May 2025, at 14:56, Senol COLAK <se...@kubedo.com> wrote:
> 
> Hello,
> 
> After upgrading from ceph reef 18.2.6 to ceph squid 19.2.1 I restarted the 
> osds and they remained down. The events contain the following records:
> 
> root@cmt6770:~# ceph health detail
> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; mon 
> cmt6770 is low on available space; 9 osds down; 1 host (3 osds) down; 5 
> nearfull osd(s); Reduced data availability: 866 pgs inactive, 489 pgs down, 5 
> pgs incomplete, 60 pgs stale; Low space hindering backfill (add storage if 
> this doesn't resolve itself): 9 pgs backfill_toofull; Degraded data 
> redundancy: 432856/3408880 objects degraded (12.698%), 191 pgs degraded, 181 
> pgs undersized; 12 pool(s) nearfull; 255 slow ops, oldest one blocked for 
> 2417 sec, daemons [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
> [WRN] FS_DEGRADED: 1 filesystem is degraded
>    fs cephfs is degraded
> [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
>    mds.cmt5923(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest 
> blocked for 38201 secs
> [WRN] MON_DISK_LOW: mon cmt6770 is low on available space
>    mon.cmt6770 has 28% avail
> [WRN] OSD_DOWN: 9 osds down
>    osd.0 (root=default,host=cmt6770) is down
>    osd.1 (root=default,host=cmt6770) is down
>    osd.2 (root=default,host=cmt6461) is down
>    osd.7 (root=default,host=cmt7773) is down
>    osd.8 (root=default,host=cmt5923) is down
>    osd.9 (root=default,host=cmt5923) is down
>    osd.14 (root=default,host=cmt6770) is down
>    osd.17 (root=default,host=cmt6461) is down
>    osd.24 (root=default,host=cmt6461) is down
> [WRN] OSD_HOST_DOWN: 1 host (3 osds) down
>    host cmt6461 (root=default) (3 osds) is down
> [WRN] OSD_NEARFULL: 5 nearfull osd(s)
>    osd.3 is near full
>    osd.12 is near full
>    osd.16 is near full
>    osd.21 is near full
>    osd.23 is near full
> [WRN] PG_AVAILABILITY: Reduced data availability: 866 pgs inactive, 489 pgs 
> down, 5 pgs incomplete, 60 pgs stale
>    pg 7.1c5 is down, acting [3]
>    pg 7.1c7 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1c8 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1cb is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1cd is down, acting [15,3]
>    pg 7.1ce is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1cf is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1d0 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1d1 is down, acting [29,13]
>    pg 7.1d2 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1d3 is down, acting [23]
>    pg 7.1d4 is down, acting [16]
>    pg 7.1d5 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1d6 is down, acting [3]
>    pg 7.1d9 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1da is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1e0 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1e1 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1e2 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1e4 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1e5 is down, acting [12,29]
>    pg 7.1e7 is stuck stale for 7m, current state stale+down, last acting [9]
>    pg 7.1e8 is down, acting [12]
>    pg 7.1e9 is stuck stale for 31m, current state stale, last acting [2]
>    pg 7.1eb is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1ed is down, acting [3]
>    pg 7.1ee is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1ef is down, acting [12]
>    pg 7.1f0 is down, acting [10]
>    pg 7.1f1 is down, acting [12,29]
>    pg 7.1f2 is down, acting [16]
>    pg 7.1f3 is stuck stale for 7m, current state stale, last acting [9]
>    pg 7.1f4 is down, acting [22]
>    pg 7.1f5 is down, acting [22]
>    pg 7.1f8 is down, acting [29]
>    pg 7.1f9 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1fb is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.1fc is stuck stale for 15m, current state stale+down, last acting [29]
>    pg 7.1fd is down, acting [3]
>    pg 7.1fe is down, acting [3,15]
>    pg 7.1ff is down, acting [3]
>    pg 7.201 is down, acting [12]
>    pg 7.204 is down, acting [10]
>    pg 7.205 is down, acting [13]
>    pg 7.207 is down, acting [11]
>    pg 7.20a is down, acting [3]
>    pg 7.20b is down, acting [22]
>    pg 7.20d is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.210 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.211 is stuck inactive for 3h, current state unknown, last acting []
>    pg 7.21b is down, acting [16]
> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this 
> doesn't resolve itself): 9 pgs backfill_toofull
>    pg 7.20 is active+undersized+degraded+remapped+backfill_toofull, acting [3]
>    pg 19.2b is active+remapped+backfill_toofull, acting [12,3]
>    pg 19.6b is active+remapped+backfill_toofull, acting [12,3]
>    pg 20.55 is active+remapped+backfill_toofull, acting [29,3]
>    pg 24.6 is active+undersized+degraded+remapped+backfill_toofull, acting 
> [22]
>    pg 24.b is active+undersized+degraded+remapped+backfill_toofull, acting 
> [21]
>    pg 24.13 is active+undersized+degraded+remapped+backfill_toofull, acting 
> [23]
>    pg 24.16 is active+undersized+degraded+remapped+backfill_toofull, acting 
> [21]
>    pg 24.1d is active+undersized+degraded+remapped+backfill_toofull, acting 
> [21]
> [WRN] PG_DEGRADED: Degraded data redundancy: 432856/3408880 objects degraded 
> (12.698%), 191 pgs degraded, 181 pgs undersized
>    pg 7.3f is stuck undersized for 9m, current state 
> active+undersized+degraded, last acting [16]
>    pg 7.56 is stuck undersized for 21m, current state 
> active+undersized+degraded, last acting [13]
>    pg 7.61 is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.66 is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.6b is stuck undersized for 25m, current state 
> active+undersized+degraded, last acting [21]
>    pg 7.102 is active+undersized+degraded, acting [23]
>    pg 7.118 is stuck undersized for 14m, current state 
> active+undersized+degraded, last acting [23]
>    pg 7.11c is stuck undersized for 9m, current state 
> active+undersized+degraded, last acting [3]
>    pg 7.133 is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [12]
>    pg 7.139 is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [13]
>    pg 7.143 is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.155 is stuck undersized for 25m, current state 
> active+undersized+degraded, last acting [3]
>    pg 7.156 is active+undersized+degraded, acting [12]
>    pg 7.15e is stuck undersized for 31m, current state 
> active+undersized+degraded, last acting [3]
>    pg 7.15f is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.168 is stuck undersized for 67m, current state 
> active+undersized+degraded, last acting [22]
>    pg 7.17f is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.180 is stuck undersized for 14m, current state 
> active+undersized+degraded, last acting [21]
>    pg 7.18e is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.193 is active+undersized+degraded, acting [16]
>    pg 7.197 is stuck undersized for 14m, current state 
> active+undersized+degraded, last acting [21]
>    pg 7.1a6 is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.1b7 is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 7.1c6 is stuck undersized for 14m, current state 
> active+undersized+degraded, last acting [22]
>    pg 7.1ca is stuck undersized for 14m, current state 
> active+undersized+degraded, last acting [22]
>    pg 7.1d7 is stuck undersized for 9m, current state 
> active+undersized+degraded, last acting [22]
>    pg 7.1df is active+undersized+degraded, acting [21]
>    pg 7.1e6 is stuck undersized for 10h, current state 
> active+undersized+degraded, last acting [23]
>    pg 7.200 is active+undersized+degraded, acting [29]
>    pg 7.202 is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [13]
>    pg 7.20c is stuck undersized for 10h, current state 
> active+undersized+degraded, last acting [16]
>    pg 7.20e is stuck undersized for 47m, current state 
> active+undersized+degraded, last acting [23]
>    pg 7.20f is stuck undersized for 47m, current state 
> active+undersized+degraded, last acting [23]
>    pg 7.217 is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [21]
>    pg 15.35 is active+undersized+degraded, acting [22]
>    pg 16.2a is stuck undersized for 10h, current state 
> active+undersized+degraded, last acting [21]
>    pg 19.43 is stuck undersized for 31m, current state 
> active+undersized+degraded, last acting [23]
>    pg 19.44 is stuck undersized for 8m, current state 
> active+undersized+degraded, last acting [29]
>    pg 19.4e is stuck undersized for 14m, current state 
> active+undersized+degraded, last acting [16]
>    pg 19.52 is active+undersized+degraded+wait, acting [3]
>    pg 19.55 is stuck undersized for 25m, current state 
> active+undersized+degraded, last acting [23]
>    pg 19.61 is stuck undersized for 25m, current state 
> active+undersized+degraded, last acting [21]
>    pg 19.72 is stuck undersized for 31m, current state 
> active+undersized+degraded, last acting [3]
>    pg 20.42 is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [23]
>    pg 20.48 is stuck undersized for 67m, current state 
> active+undersized+degraded, last acting [16]
>    pg 20.5b is stuck undersized for 21m, current state 
> active+undersized+degraded, last acting [12]
>    pg 20.5f is stuck undersized for 10h, current state 
> active+undersized+degraded, last acting [12]
>    pg 20.65 is stuck undersized for 10m, current state 
> active+undersized+degraded, last acting [23]
>    pg 20.6a is active+undersized+degraded, acting [13]
>    pg 20.71 is stuck undersized for 31m, current state 
> active+undersized+degraded, last acting [13]
>    pg 20.7d is stuck undersized for 7m, current state 
> active+undersized+degraded, last acting [29]
> [WRN] POOL_NEARFULL: 12 pool(s) nearfull
>    pool '.mgr' is nearfull
>    pool 'DataStore' is nearfull
>    pool 'cephfs_data' is nearfull
>    pool 'cephfs_metadata' is nearfull
>    pool 'OS' is nearfull
>    pool 'cloud' is nearfull
>    pool 'DataStore_2' is nearfull
>    pool 'DataStore_3' is nearfull
>    pool 'MGMT' is nearfull
>    pool 'DataStore_4' is nearfull
>    pool 'DataStore_5' is nearfull
>    pool 'fast' is nearfull
> [WRN] SLOW_OPS: 255 slow ops, oldest one blocked for 2417 sec, daemons 
> [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
> root@cmt6770:~# ceph -s
>  cluster:
>    id:     9319dafb-3408-46cb-9b09-b3d381114545
>    health: HEALTH_WARN
>            1 filesystem is degraded
>            1 MDSs report slow metadata IOs
>            mon cmt6770 is low on available space
>            9 osds down
>            1 host (3 osds) down
>            5 nearfull osd(s)
>            Reduced data availability: 866 pgs inactive, 489 pgs down, 5 pgs 
> incomplete, 60 pgs stale
>            Low space hindering backfill (add storage if this doesn't resolve 
> itself): 9 pgs backfill_toofull
>            Degraded data redundancy: 432856/3408880 objects degraded 
> (12.698%), 191 pgs degraded, 181 pgs undersized
>            12 pool(s) nearfull
>            255 slow ops, oldest one blocked for 2422 sec, daemons 
> [osd.10,osd.12,osd.21,osd.22,osd.23] have slow ops.
> 
>  services:
>    mon: 2 daemons, quorum cmt6770,cmt5923 (age 70m)
>    mgr: cmt6770(active, since 3h)
>    mds: 1/1 daemons up, 1 standby
>    osd: 25 osds: 11 up (since 14s), 20 in (since 9m); 182 remapped pgs
> 
>  data:
>    volumes: 0/1 healthy, 1 recovering
>    pools:   12 pools, 1589 pgs
>    objects: 1.70M objects, 6.2 TiB
>    usage:   9.1 TiB used, 6.2 TiB / 15 TiB avail
>    pgs:     28.760% pgs unknown
>             34.991% pgs not active
>             432856/3408880 objects degraded (12.698%)
>             388136/3408880 objects misplaced (11.386%)
>             466 down
>             457 unknown
>             209 active+clean
>             185 active+undersized+degraded
>             157 active+clean+remapped
>             62  stale
>             20  stale+down
>             9   active+undersized+remapped
>             6   active+undersized+degraded+remapped+backfill_toofull
>             5   incomplete
>             3   active+remapped+backfill_toofull
>             2   active+clean+scrubbing+deep
>             2   active+clean+remapped+scrubbing+deep
>             2   down+remapped
>             1   stale+creating+down
>             1   active+remapped+backfilling
>             1   active+remapped+backfill_wait
>             1   active+undersized+remapped+wait
> 
>  io:
>    recovery: 12 MiB/s, 3 objects/s
> 
> root@cmt6770:~# ceph osd tree
> ID   CLASS  WEIGHT    TYPE NAME         STATUS  REWEIGHT  PRI-AFF
> -1         34.05125  root default
> -3          5.38539      host cmt5923
>  3    ssd   1.74660          osd.3         up   0.79999  1.00000
>  8    ssd   1.81940          osd.8       down   1.00000  1.00000
>  9    ssd   1.81940          osd.9       down   1.00000  1.00000
> -15          4.40289      host cmt6461
> 24   nvme   0.90970          osd.24      down   0.79999  1.00000
>  2    ssd   1.74660          osd.2       down   1.00000  1.00000
> 17    ssd   1.74660          osd.17      down   1.00000  1.00000
> -5          5.35616      host cmt6770
>  0    ssd   0.87329          osd.0       down   1.00000  1.00000
>  1    ssd   0.87329          osd.1       down   1.00000  1.00000
>  4    ssd   1.86299          osd.4       down         0  1.00000
> 14    ssd   0.87329          osd.14      down   1.00000  1.00000
> 15    ssd   0.87329          osd.15        up   1.00000  1.00000
> -9          7.24838      host cmt7773
>  5   nvme   1.81940          osd.5       down         0  1.00000
> 19   nvme   1.81940          osd.19      down         0  1.00000
>  7    ssd   1.74660          osd.7       down   1.00000  1.00000
> 29    ssd   1.86299          osd.29        up   1.00000  1.00000
> -13          7.93245      host dc2943
> 22   nvme   0.90970          osd.22        up   1.00000  1.00000
> 23   nvme   0.90970          osd.23        up   1.00000  1.00000
>  6    ssd   1.74660          osd.6       down         0  1.00000
> 10    ssd   0.87329          osd.10        up   1.00000  1.00000
> 11    ssd   0.87329          osd.11        up   1.00000  1.00000
> 12    ssd   0.87329          osd.12        up   1.00000  1.00000
> 13    ssd   0.87329          osd.13        up   1.00000  1.00000
> 16    ssd   0.87329          osd.16        up   0.79999  1.00000
> -11          3.72598      host dc3658
> 20   nvme   1.86299          osd.20      down         0  1.00000
> 21   nvme   1.86299          osd.21        up   0.90002  1.00000
> root@cmt6770:~# ceph osd df
> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     
> AVAIL    %USE   VAR   PGS  STATUS
> 3    ssd  1.74660   0.79999  1.7 TiB  1.5 TiB  1.5 TiB  228 KiB  2.9 GiB  251 
> GiB  85.96  1.44  214      up
> 8    ssd  1.81940   1.00000      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    0    down
> 9    ssd  1.81940   1.00000  1.8 TiB  776 MiB  745 MiB    8 KiB   31 MiB  1.8 
> TiB   0.04     0   46    down
> 24   nvme  0.90970   0.79999      0 B      0 B      0 B      0 B      0 B     
>  0 B      0     0    0    down
> 2    ssd  1.74660   1.00000      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    4    down
> 17    ssd  1.74660   1.00000      0 B      0 B      0 B      0 B      0 B     
>  0 B      0     0    0    down
> 0    ssd  0.87329   1.00000      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    3    down
> 1    ssd  0.87329   1.00000      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    0    down
> 4    ssd  1.86299         0      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    0    down
> 14    ssd  0.87329   1.00000  894 GiB  792 MiB  752 MiB   14 KiB   40 MiB  
> 893 GiB   0.09  0.00   25    down
> 15    ssd  0.87329   1.00000  894 GiB  232 GiB  231 GiB   14 KiB  1.4 GiB  
> 662 GiB  25.98  0.44   83      up
> 5   nvme  1.81940         0      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    0    down
> 19   nvme  1.81940         0      0 B      0 B      0 B      0 B      0 B     
>  0 B      0     0    0    down
> 7    ssd  1.74660   1.00000      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    0    down
> 29    ssd  1.86299   1.00000  1.9 TiB  1.5 TiB  1.5 TiB  323 KiB  2.8 GiB  
> 354 GiB  81.44  1.37  222      up
> 22   nvme  0.90970   1.00000  932 GiB  689 GiB  687 GiB  181 KiB  1.6 GiB  
> 243 GiB  73.96  1.24  139      up
> 23   nvme  0.90970   1.00000  932 GiB  820 GiB  818 GiB  138 KiB  2.0 GiB  
> 112 GiB  87.98  1.48  144      up
> 6    ssd  1.74660         0      0 B      0 B      0 B      0 B      0 B      
> 0 B      0     0    0    down
> 10    ssd  0.87329   1.00000  894 GiB  237 GiB  235 GiB    1 KiB  1.2 GiB  
> 658 GiB  26.46  0.44   82      up
> 11    ssd  0.87329   1.00000  894 GiB  264 GiB  263 GiB    1 KiB  1.4 GiB  
> 630 GiB  29.54  0.50   67      up
> 12    ssd  0.87329   1.00000  894 GiB  780 GiB  778 GiB  123 KiB  1.8 GiB  
> 114 GiB  87.26  1.46  113      up
> 13    ssd  0.87329   1.00000  894 GiB  684 GiB  682 GiB  170 KiB  1.9 GiB  
> 210 GiB  76.48  1.28   98      up
> 16    ssd  0.87329   0.79999  894 GiB  779 GiB  777 GiB  149 KiB  1.8 GiB  
> 116 GiB  87.06  1.46   86      up
> 20   nvme  1.86299         0      0 B      0 B      0 B      0 B      0 B     
>  0 B      0     0    0    down
> 21   nvme  1.86299   0.90002  1.9 TiB  1.7 TiB  1.7 TiB  430 KiB  3.5 GiB  
> 194 GiB  89.84  1.51  314      up
>                       TOTAL   15 TiB  9.1 TiB  9.1 TiB  1.7 MiB   22 GiB  6.2 
> TiB  59.60
> MIN/MAX VAR: 0/1.51  STDDEV: 44.89
> 
> 
> also osd start logs:
> May 10 15:38:32 cmt5923 systemd[1]: ceph-osd@9.service: Failed with result 
> 'signal'.
> May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.579+0300 
> 764caf13f880 -1 osd.8 100504 log_to_monitors true
> May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 
> 764c8c64b6c0 -1 log_channel(cluster) log [ERR] : 7.26a past_intervals 
> [96946,100253) start interval does not contain the required bound 
> [93903,100253) start
> May 10 15:38:37 cmt5923 ceph-osd[2383902]: 2025-05-10T15:38:37.791+0300 
> 764c8c64b6c0 -1 osd.8 pg_epoch: 100377 pg[7.26a( empty local-lis/les=0/0 n=0 
> ec=96946/96946 lis/c=96236/93898 les/c/f=96237/93903/91308 sis=100253) [3,1] 
> r=-1 lpr=100376 pi=[96946,100253)/3 crt=0'0 mlcod 0'0 unknown mbc={}] 
> PeeringState::check_past_interval_bounds 7.26a past_intervals [96946,100253) 
> start interval does not contain the required bound [93903,100253) start
> 
> We appreciate any support and guidance,
> Thanks in advance
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: We lost the stability of the cluster, 18.2.2 -> 18.2.6 -> 19.2.1 Chain of upgrade failure

Reply via email to