Hi Glen

Run a ceph pg {id} query on one of your stuck PGs to find out what the PG
is waiting for to be completed.

Rgds
JC


On Friday, January 23, 2015, Glen Aidukas <gaidu...@behaviormatrix.com>
wrote:

>  Hello fellow ceph users,
>
>
>
> I ran into a major issue were two KVM hosts will not start due to issues
> with my Ceph cluster.
>
>
>
> Here are some details:
>
>
>
> Running ceph version 0.87.  There are 10 hosts with 6 drives each for 60
> OSDs.
>
>
>
> # ceph -s
>
>     cluster 1431e336-faa2-4b13-b50d-c1d375b4e64b
>
>      health HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs
> stuck unclean; 71 requests are blocked > 32 sec; pool rbd-b has too few pgs
>
>      monmap e1: 3 mons at {xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx},
> election epoch 92, quorum 0,1,2 ceph-b01,ceph-b02,ceph-b03
>
>      mdsmap e49: 1/1/1 up {0=pmceph-b06=up:active}, 1 up:standby
>
>      osdmap e10023: 60 osds: 60 up, 60 in
>
>       pgmap v19851672: 45056 pgs, 22 pools, 13318 GB data, 3922 kobjects
>
>             39863 GB used, 178 TB / 217 TB avail
>
>                45049 active+clean
>
>                    7 incomplete
>
>   client io 954 kB/s rd, 386 kB/s wr, 78 op/s
>
>
>
> # ceph health detail
>
> HEALTH_WARN 7 pgs incomplete; 7 pgs stuck inactive; 7 pgs stuck unclean;
> 69 requests are blocked > 32 sec; 5 osds have slow requests; pool rbd-b has
> too few pgs
>
> pg 3.38b is stuck inactive since forever, current state incomplete, last
> acting [48,35,2]
>
> pg 1.541 is stuck inactive since forever, current state incomplete, last
> acting [48,20,2]
>
> pg 3.57d is stuck inactive for 15676.967208, current state incomplete,
> last acting [55,48,2]
>
> pg 3.5c9 is stuck inactive since forever, current state incomplete, last
> acting [48,2,15]
>
> pg 3.540 is stuck inactive for 15676.959093, current state incomplete,
> last acting [57,48,2]
>
> pg 3.5a5 is stuck inactive since forever, current state incomplete, last
> acting [2,48,57]
>
> pg 3.305 is stuck inactive for 15676.855987, current state incomplete,
> last acting [39,2,48]
>
> pg 3.38b is stuck unclean since forever, current state incomplete, last
> acting [48,35,2]
>
> pg 1.541 is stuck unclean since forever, current state incomplete, last
> acting [48,20,2]
>
> pg 3.57d is stuck unclean for 15676.971318, current state incomplete, last
> acting [55,48,2]
>
> pg 3.5c9 is stuck unclean since forever, current state incomplete, last
> acting [48,2,15]
>
> pg 3.540 is stuck unclean for 15676.963204, current state incomplete, last
> acting [57,48,2]
>
> pg 3.5a5 is stuck unclean since forever, current state incomplete, last
> acting [2,48,57]
>
> pg 3.305 is stuck unclean for 15676.860098, current state incomplete, last
> acting [39,2,48]
>
> pg 3.5c9 is incomplete, acting [48,2,15] (reducing pool rbd-b min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> pg 3.5a5 is incomplete, acting [2,48,57] (reducing pool rbd-b min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> pg 3.57d is incomplete, acting [55,48,2] (reducing pool rbd-b min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> pg 3.540 is incomplete, acting [57,48,2] (reducing pool rbd-b min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> pg 1.541 is incomplete, acting [48,20,2] (reducing pool metadata min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> pg 3.38b is incomplete, acting [48,35,2] (reducing pool rbd-b min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> pg 3.305 is incomplete, acting [39,2,48] (reducing pool rbd-b min_size
> from 2 may help; search ceph.com/docs for 'incomplete')
>
> 20 ops are blocked > 2097.15 sec
>
> 49 ops are blocked > 1048.58 sec
>
> 13 ops are blocked > 2097.15 sec on osd.2
>
> 7 ops are blocked > 2097.15 sec on osd.39
>
> 3 ops are blocked > 1048.58 sec on osd.39
>
> 41 ops are blocked > 1048.58 sec on osd.48
>
> 4 ops are blocked > 1048.58 sec on osd.55
>
> 1 ops are blocked > 1048.58 sec on osd.57
>
> 5 osds have slow requests
>
> pool rbd-b objects per pg (1084) is more than 12.1798 times cluster
> average (89)
>
>
>
> I ran the following but did not help:
>
>
>
> # ceph health detail | grep ^pg | cut -c4-9 | while read i; do ceph pg
> repair ${i} ; done
>
> instructing pg 3.38b on osd.48 to repair
>
> instructing pg 1.541 on osd.48 to repair
>
> instructing pg 3.57d on osd.55 to repair
>
> instructing pg 3.5c9 on osd.48 to repair
>
> instructing pg 3.540 on osd.57 to repair
>
> instructing pg 3.5a5 on osd.2 to repair
>
> instructing pg 3.305 on osd.39 to repair
>
> instructing pg 3.38b on osd.48 to repair
>
> instructing pg 1.541 on osd.48 to repair
>
> instructing pg 3.57d on osd.55 to repair
>
> instructing pg 3.5c9 on osd.48 to repair
>
> instructing pg 3.540 on osd.57 to repair
>
> instructing pg 3.5a5 on osd.2 to repair
>
> instructing pg 3.305 on osd.39 to repair
>
> instructing pg 3.5c9 on osd.48 to repair
>
> instructing pg 3.5a5 on osd.2 to repair
>
> instructing pg 3.57d on osd.55 to repair
>
> instructing pg 3.540 on osd.57 to repair
>
> instructing pg 1.541 on osd.48 to repair
>
> instructing pg 3.38b on osd.48 to repair
>
> instructing pg 3.305 on osd.39 to repair
>
>
>
> Also, if I run the following cmd, it seems to just hang.
>
>
>
> rbd -p rbd-b info vm-50193-disk-1    ß hangs until I do CTRL-c…
>
>
>
>
>
> Any help would be greatly appreciated!
>
>
>
> *Glen Aidukas*
>
> *Manager IT Infrastructure*
>
> t: 610.813.2815
>
>
>
> [image: final logo for signature v]
>
>
>
> BehaviorMatrix, LLC | 676 Dekalb Pike, Suite 200, Blue Bell, PA, 19422
>
> www.behaviormatrix.com
>
>
>


-- 
Sent while moving
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to