[ceph-users] Manually mucked up pg, need help fixing

Jeff Bachtel Sat, 03 May 2014 09:17:19 -0700

This is all on firefly rc1 on CentOS 6

I had an osd getting overfull, and misinterpreting directions I downedit then manually removed pg directories from the osd mount. On restartand after a good deal of rebalancing (setting osd weights as I should'veoriginally), I'm now at


    cluster de10594a-0737-4f34-a926-58dc9254f95f

health HEALTH_WARN 2 pgs backfill; 1 pgs incomplete; 1 pgs stuckinactive; 308 pgs stuck unclean; recov

ery 1/2420563 objects degraded (0.000%); noout flag(s) set

monmap e7: 3 mons at{controller1=10.100.2.1:6789/0,controller2=10.100.2.2:6789/0,controller3=10.100.2.3:6789/0}, election epoch 556, quorum 0,1,2controller1,controller2,controller3

     mdsmap e268: 1/1/1 up {0=controller1=up:active}
     osdmap e3492: 5 osds: 5 up, 5 in
            flags noout
      pgmap v4167420: 320 pgs, 15 pools, 4811 GB data, 1181 kobjects
            9770 GB used, 5884 GB / 15654 GB avail
            1/2420563 objects degraded (0.000%)
                   3 active
                  12 active+clean
                   2 active+remapped+wait_backfill
                   1 incomplete
                 302 active+remapped
  client io 364 B/s wr, 0 op/s

# ceph pg dump | grep 0.2f
dumped all in format plain

0.2f 0 0 0 0 0 0 0incomplete 2014-05-03 11:38:01.526832 0'0 3492:23 [4] 4[4] 4 2254'20053 2014-04-28 00:24:36.5040862100'18109 2014-04-26 22:26:23.699330


# ceph pg map 0.2f
osdmap e3492 pg 0.2f (0.2f) -> up [4] acting [4]

The pg query for the downed pg is athttps://gist.github.com/jeffb-bt/c8730899ff002070b325

Of course, the osd I manually mucked with is the only one the cluster ispicking up as up/acting. Now, I can query the pg and find epochs whereother osds (that I didn't jack up) were acting. And in fact, the latestof those entries (osd.1) has the pg directory in its osd mount, and it'sa good healthy 59gb.

I've tried manually rsync'ing (and preserving attributes) that set ofdirectories from osd.1 to osd.4 without success. Likewise I've triedcopying the directories over without attributes set. I've done many,many deep scrubs but the pg query does not show the scrub timestampsbeing affected.

I'm seeking ideas for either fixing metadata on the directory on osd.4to cause this pg to be seen/recognized, or ideas on forcing thecluster's pg map to point to osd.1 for the incomplete pg (basicallywiping out the cluster's memory that osd.4 ever had 0.2f). Or any othersolution :) It's only 59g, so worst case I'll mark it lost and recreatethe pg, but I'd prefer to learn enough of the innards to understand whatis going on, and possible means of fixing it.


Thanks for any help,

Jeff

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Manually mucked up pg, need help fixing

Reply via email to