Dear list,

Our ceph cluster (ceph version 0.87) is stuck in a warning state with some OSDs out of their original bucket:

health HEALTH_WARN 1097 pgs degraded; 15 pgs peering; 1 pgs recovering; 1097 pgs stuck degraded; 16 pgs stuck inactive; 26148 pgs stuck unclean; 1096 pgs stuck undersized; 1096 pgs undersized; 4 requests are blocked > 32 sec; recovery 101465/6016350 objects degraded (1.686%); 1691712/6016350 objects misplaced (28.119%) monmap e2: 3 mons at {mon1-r2-ser=172.19.14.130:6789/0,mon1-r3-ser=172.19.14.150:6789/0,mon1-rc3-fib=172.19.14.170:6789/0}, election epoch 82, quorum 0,1,2 mon1-r2-ser,mon1-r3-ser,mon1-rc3-fib
     osdmap e15358: 144 osds: 143 up, 143 in
      pgmap v12209990: 38816 pgs, 16 pools, 8472 GB data, 1958 kobjects
            25821 GB used, 234 TB / 259 TB avail
101465/6016350 objects degraded (1.686%); 1691712/6016350 objects misplaced (28.119%)
                 620 active
               12668 active+clean
                  15 peering
                 395 active+undersized+degraded+remapped
                   1 active+recovering+degraded
               24416 active+remapped
                   1 undersized+degraded
                 700 active+undersized+degraded
  client io 0 B/s rd, 40557 B/s wr, 13 op/s

Yesterday it was just in a warning state with some PG stuck unclean and some requests blocked. As I restarted one of the OSD involved, a recovery process started and some OSD went down and then up and some others where put out of their original bucket:

# id    weight  type name       up/down reweight
-1      262.1   root default
-15     80.08           datacenter fibonacci
-16     80.08                   rack rack-c03-fib
............
-35     83.72           datacenter ingegneria
-31     0                       rack rack-01-ing
-32     0                       rack rack-02-ing
-33     0                       rack rack-03-ing
-34     0                       rack rack-04-ing
-18     83.72                   rack rack-03-ser
-13     20.02                           host-high-end cnode1-r3-ser
124     1.82                                    osd.124 up      1
126     1.82                                    osd.126 up      1
128     1.82                                    osd.128 up      1
133     1.82                                    osd.133 up      1
135     1.82                                    osd.135 up      1
…………
145     1.82                                    osd.145 up      1
146     1.82                                    osd.146 up      1
147     1.82                                    osd.147 up      1
148     1.82                                    osd.148 up      1
5       1.82            osd.5   up      1
150     1.82            osd.150 up      1
153     1.82            osd.153 up      1
80      1.82            osd.80  up      1
24      1.82            osd.24  up      1
131     1.82            osd.131 up      1

Now, if I put by hand the OSD in its own bucket it works, but I have some concerns: why the recovery process is stopped? The cluster is almost empty so there is space to recover data even without 6 OSD. Did anyone already experience this?
Any advice for what to search?
Any help is appreciated.

Regards
Simone



--
Simone Spinelli <simone.spine...@unipi.it>
Università di Pisa
Settore Rete, Telecomunicazioni e Fonia - Serra
Direzione Edilizia e Telecomunicazioni

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to