Well, that'd be the ideal solution. Please check out the github gist I
posted, though. It seems that despite osd.4 having nothing good for pg
0.2f, the cluster does not acknowledge any other osd has a copy of the
pg. I've tried downing osd.4 and manually deleting the pg directory in
question with the hope that the cluster would roll back epochs for 0.2f,
but all it does is recreate the pg directory (empty) on osd.4.
Jeff
On 05/05/2014 04:33 PM, Gregory Farnum wrote:
What's your cluster look like? I wonder if you can just remove the bad
PG from osd.4 and let it recover from the existing osd.1
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
On Sat, May 3, 2014 at 9:17 AM, Jeff Bachtel
<jbach...@bericotechnologies.com> wrote:
This is all on firefly rc1 on CentOS 6
I had an osd getting overfull, and misinterpreting directions I downed it
then manually removed pg directories from the osd mount. On restart and
after a good deal of rebalancing (setting osd weights as I should've
originally), I'm now at
cluster de10594a-0737-4f34-a926-58dc9254f95f
health HEALTH_WARN 2 pgs backfill; 1 pgs incomplete; 1 pgs stuck
inactive; 308 pgs stuck unclean; recov
ery 1/2420563 objects degraded (0.000%); noout flag(s) set
monmap e7: 3 mons at
{controller1=10.100.2.1:6789/0,controller2=10.100.2.2:6789/0,controller3=10.100.2.
3:6789/0}, election epoch 556, quorum 0,1,2
controller1,controller2,controller3
mdsmap e268: 1/1/1 up {0=controller1=up:active}
osdmap e3492: 5 osds: 5 up, 5 in
flags noout
pgmap v4167420: 320 pgs, 15 pools, 4811 GB data, 1181 kobjects
9770 GB used, 5884 GB / 15654 GB avail
1/2420563 objects degraded (0.000%)
3 active
12 active+clean
2 active+remapped+wait_backfill
1 incomplete
302 active+remapped
client io 364 B/s wr, 0 op/s
# ceph pg dump | grep 0.2f
dumped all in format plain
0.2f 0 0 0 0 0 0 0 incomplete
2014-05-03 11:38:01.526832 0'0 3492:23 [4] 4 [4] 4
2254'20053 2014-04-28 00:24:36.504086 2100'18109 2014-04-26
22:26:23.699330
# ceph pg map 0.2f
osdmap e3492 pg 0.2f (0.2f) -> up [4] acting [4]
The pg query for the downed pg is at
https://gist.github.com/jeffb-bt/c8730899ff002070b325
Of course, the osd I manually mucked with is the only one the cluster is
picking up as up/acting. Now, I can query the pg and find epochs where other
osds (that I didn't jack up) were acting. And in fact, the latest of those
entries (osd.1) has the pg directory in its osd mount, and it's a good
healthy 59gb.
I've tried manually rsync'ing (and preserving attributes) that set of
directories from osd.1 to osd.4 without success. Likewise I've tried copying
the directories over without attributes set. I've done many, many deep
scrubs but the pg query does not show the scrub timestamps being affected.
I'm seeking ideas for either fixing metadata on the directory on osd.4 to
cause this pg to be seen/recognized, or ideas on forcing the cluster's pg
map to point to osd.1 for the incomplete pg (basically wiping out the
cluster's memory that osd.4 ever had 0.2f). Or any other solution :) It's
only 59g, so worst case I'll mark it lost and recreate the pg, but I'd
prefer to learn enough of the innards to understand what is going on, and
possible means of fixing it.
Thanks for any help,
Jeff
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com