Hi Glen Run a ceph pg {id} query on one of your stuck PGs to find out what the PG is waiting for to be completed.
Rgds JC On Friday, January 23, 2015, Glen Aidukas <gaidu...@behaviormatrix.com> wrote: > Hello fellow ceph users, > > > > I ran into a major issue were two KVM hosts will not start due to issues > with my Ceph cluster. > > > > Here are some details: > > > > Running ceph version 0.87. There are 10 hosts with 6 drives each for 60 > OSDs. > > > > # ceph -s > > cluster 1431e336-faa2-4b13-b50d-c1d375b4e64b > > health HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs > stuck unclean; 71 requests are blocked > 32 sec; pool rbd-b has too few pgs > > monmap e1: 3 mons at {xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx}, > election epoch 92, quorum 0,1,2 ceph-b01,ceph-b02,ceph-b03 > > mdsmap e49: 1/1/1 up {0=pmceph-b06=up:active}, 1 up:standby > > osdmap e10023: 60 osds: 60 up, 60 in > > pgmap v19851672: 45056 pgs, 22 pools, 13318 GB data, 3922 kobjects > > 39863 GB used, 178 TB / 217 TB avail > > 45049 active+clean > > 7 incomplete > > client io 954 kB/s rd, 386 kB/s wr, 78 op/s > > > > # ceph health detail > > HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean; > 69 requests are blocked > 32 sec; 5 osds have slow requests; pool rbd-b has > too few pgs > > pg 3.38b is stuck inactive since forever, current state incomplete, last > acting [48,35,2] > > pg 1.541 is stuck inactive since forever, current state incomplete, last > acting [48,20,2] > > pg 3.57d is stuck inactive for 15676.967208, current state incomplete, > last acting [55,48,2] > > pg 3.5c9 is stuck inactive since forever, current state incomplete, last > acting [48,2,15] > > pg 3.540 is stuck inactive for 15676.959093, current state incomplete, > last acting [57,48,2] > > pg 3.5a5 is stuck inactive since forever, current state incomplete, last > acting [2,48,57] > > pg 3.305 is stuck inactive for 15676.855987, current state incomplete, > last acting [39,2,48] > > pg 3.38b is stuck unclean since forever, current state incomplete, last > acting [48,35,2] > > pg 1.541 is stuck unclean since forever, current state incomplete, last > acting [48,20,2] > > pg 3.57d is stuck unclean for 15676.971318, current state incomplete, last > acting [55,48,2] > > pg 3.5c9 is stuck unclean since forever, current state incomplete, last > acting [48,2,15] > > pg 3.540 is stuck unclean for 15676.963204, current state incomplete, last > acting [57,48,2] > > pg 3.5a5 is stuck unclean since forever, current state incomplete, last > acting [2,48,57] > > pg 3.305 is stuck unclean for 15676.860098, current state incomplete, last > acting [39,2,48] > > pg 3.5c9 is incomplete, acting [48,2,15] (reducing pool rbd-b min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > pg 3.5a5 is incomplete, acting [2,48,57] (reducing pool rbd-b min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > pg 3.57d is incomplete, acting [55,48,2] (reducing pool rbd-b min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > pg 3.540 is incomplete, acting [57,48,2] (reducing pool rbd-b min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > pg 1.541 is incomplete, acting [48,20,2] (reducing pool metadata min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > pg 3.38b is incomplete, acting [48,35,2] (reducing pool rbd-b min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > pg 3.305 is incomplete, acting [39,2,48] (reducing pool rbd-b min_size > from 2 may help; search ceph.com/docs for 'incomplete') > > 20 ops are blocked > 2097.15 sec > > 49 ops are blocked > 1048.58 sec > > 13 ops are blocked > 2097.15 sec on osd.2 > > 7 ops are blocked > 2097.15 sec on osd.39 > > 3 ops are blocked > 1048.58 sec on osd.39 > > 41 ops are blocked > 1048.58 sec on osd.48 > > 4 ops are blocked > 1048.58 sec on osd.55 > > 1 ops are blocked > 1048.58 sec on osd.57 > > 5 osds have slow requests > > pool rbd-b objects per pg (1084) is more than 12.1798 times cluster > average (89) > > > > I ran the following but did not help: > > > > # ceph health detail | grep ^pg | cut -c4-9 | while read i; do ceph pg > repair ${i} ; done > > instructing pg 3.38b on osd.48 to repair > > instructing pg 1.541 on osd.48 to repair > > instructing pg 3.57d on osd.55 to repair > > instructing pg 3.5c9 on osd.48 to repair > > instructing pg 3.540 on osd.57 to repair > > instructing pg 3.5a5 on osd.2 to repair > > instructing pg 3.305 on osd.39 to repair > > instructing pg 3.38b on osd.48 to repair > > instructing pg 1.541 on osd.48 to repair > > instructing pg 3.57d on osd.55 to repair > > instructing pg 3.5c9 on osd.48 to repair > > instructing pg 3.540 on osd.57 to repair > > instructing pg 3.5a5 on osd.2 to repair > > instructing pg 3.305 on osd.39 to repair > > instructing pg 3.5c9 on osd.48 to repair > > instructing pg 3.5a5 on osd.2 to repair > > instructing pg 3.57d on osd.55 to repair > > instructing pg 3.540 on osd.57 to repair > > instructing pg 1.541 on osd.48 to repair > > instructing pg 3.38b on osd.48 to repair > > instructing pg 3.305 on osd.39 to repair > > > > Also, if I run the following cmd, it seems to just hang. > > > > rbd -p rbd-b info vm-50193-disk-1 ß hangs until I do CTRL-c… > > > > > > Any help would be greatly appreciated! > > > > *Glen Aidukas* > > *Manager IT Infrastructure* > > t: 610.813.2815 > > > > [image: final logo for signature v] > > > > BehaviorMatrix, LLC | 676 Dekalb Pike, Suite 200, Blue Bell, PA, 19422 > > www.behaviormatrix.com > > > -- Sent while moving
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com