Hello fellow ceph users,

I ran into a major issue were two KVM hosts will not start due to issues with 
my Ceph cluster.

Here are some details:

Running ceph version 0.87.  There are 10 hosts with 6 drives each for 60 OSDs.

# ceph -s
    cluster 1431e336-faa2-4b13-b50d-c1d375b4e64b
     health HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck 
unclean; 71 requests are blocked > 32 sec; pool rbd-b has too few pgs
     monmap e1: 3 mons at {xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx}, 
election epoch 92, quorum 0,1,2 ceph-b01,ceph-b02,ceph-b03
     mdsmap e49: 1/1/1 up {0=pmceph-b06=up:active}, 1 up:standby
     osdmap e10023: 60 osds: 60 up, 60 in
      pgmap v19851672: 45056 pgs, 22 pools, 13318 GB data, 3922 kobjects
            39863 GB used, 178 TB / 217 TB avail
               45049 active+clean
                   7 incomplete
  client io 954 kB/s rd, 386 kB/s wr, 78 op/s

# ceph health detail
HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean; 69 
requests are blocked > 32 sec; 5 osds have slow requests; pool rbd-b has too 
few pgs
pg 3.38b is stuck inactive since forever, current state incomplete, last acting 
[48,35,2]
pg 1.541 is stuck inactive since forever, current state incomplete, last acting 
[48,20,2]
pg 3.57d is stuck inactive for 15676.967208, current state incomplete, last 
acting [55,48,2]
pg 3.5c9 is stuck inactive since forever, current state incomplete, last acting 
[48,2,15]
pg 3.540 is stuck inactive for 15676.959093, current state incomplete, last 
acting [57,48,2]
pg 3.5a5 is stuck inactive since forever, current state incomplete, last acting 
[2,48,57]
pg 3.305 is stuck inactive for 15676.855987, current state incomplete, last 
acting [39,2,48]
pg 3.38b is stuck unclean since forever, current state incomplete, last acting 
[48,35,2]
pg 1.541 is stuck unclean since forever, current state incomplete, last acting 
[48,20,2]
pg 3.57d is stuck unclean for 15676.971318, current state incomplete, last 
acting [55,48,2]
pg 3.5c9 is stuck unclean since forever, current state incomplete, last acting 
[48,2,15]
pg 3.540 is stuck unclean for 15676.963204, current state incomplete, last 
acting [57,48,2]
pg 3.5a5 is stuck unclean since forever, current state incomplete, last acting 
[2,48,57]
pg 3.305 is stuck unclean for 15676.860098, current state incomplete, last 
acting [39,2,48]
pg 3.5c9 is incomplete, acting [48,2,15] (reducing pool rbd-b min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 3.5a5 is incomplete, acting [2,48,57] (reducing pool rbd-b min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 3.57d is incomplete, acting [55,48,2] (reducing pool rbd-b min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 3.540 is incomplete, acting [57,48,2] (reducing pool rbd-b min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 1.541 is incomplete, acting [48,20,2] (reducing pool metadata min_size from 
2 may help; search ceph.com/docs for 'incomplete')
pg 3.38b is incomplete, acting [48,35,2] (reducing pool rbd-b min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 3.305 is incomplete, acting [39,2,48] (reducing pool rbd-b min_size from 2 
may help; search ceph.com/docs for 'incomplete')
20 ops are blocked > 2097.15 sec
49 ops are blocked > 1048.58 sec
13 ops are blocked > 2097.15 sec on osd.2
7 ops are blocked > 2097.15 sec on osd.39
3 ops are blocked > 1048.58 sec on osd.39
41 ops are blocked > 1048.58 sec on osd.48
4 ops are blocked > 1048.58 sec on osd.55
1 ops are blocked > 1048.58 sec on osd.57
5 osds have slow requests
pool rbd-b objects per pg (1084) is more than 12.1798 times cluster average (89)

I ran the following but did not help:

# ceph health detail | grep ^pg | cut -c4-9 | while read i; do ceph pg repair 
${i} ; done
instructing pg 3.38b on osd.48 to repair
instructing pg 1.541 on osd.48 to repair
instructing pg 3.57d on osd.55 to repair
instructing pg 3.5c9 on osd.48 to repair
instructing pg 3.540 on osd.57 to repair
instructing pg 3.5a5 on osd.2 to repair
instructing pg 3.305 on osd.39 to repair
instructing pg 3.38b on osd.48 to repair
instructing pg 1.541 on osd.48 to repair
instructing pg 3.57d on osd.55 to repair
instructing pg 3.5c9 on osd.48 to repair
instructing pg 3.540 on osd.57 to repair
instructing pg 3.5a5 on osd.2 to repair
instructing pg 3.305 on osd.39 to repair
instructing pg 3.5c9 on osd.48 to repair
instructing pg 3.5a5 on osd.2 to repair
instructing pg 3.57d on osd.55 to repair
instructing pg 3.540 on osd.57 to repair
instructing pg 1.541 on osd.48 to repair
instructing pg 3.38b on osd.48 to repair
instructing pg 3.305 on osd.39 to repair

Also, if I run the following cmd, it seems to just hang.

rbd -p rbd-b info vm-50193-disk-1    <-- hangs until I do CTRL-c...


Any help would be greatly appreciated!

Glen Aidukas
Manager IT Infrastructure
t: 610.813.2815

[final logo for signature v]

BehaviorMatrix, LLC | 676 Dekalb Pike, Suite 200, Blue Bell, PA, 19422
www.behaviormatrix.com<http://www.behaviormatrix.com/>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to