Yeah, the log's not super helpful, but that and your description give us something to talk about. Thanks! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com
On Tue, Apr 8, 2014 at 8:20 PM, Craig Lewis <cle...@centraldesktop.com> wrote: > > Craig Lewis > Senior Systems Engineer > Office +1.714.602.1309 > Email cle...@centraldesktop.com > > Central Desktop. Work together in ways you never thought possible. > Connect with us Website | Twitter | Facebook | LinkedIn | Blog > > On 4/8/14 18:27 , Gregory Farnum wrote: > > On Tue, Apr 8, 2014 at 4:57 PM, Craig Lewis <cle...@centraldesktop.com> > wrote: > > pg query says the recovery state is: > "might_have_unfound": [ > { "osd": 11, > "status": "querying"}, > { "osd": 13, > "status": "already probed"}], > > I figured out why it wasn't probing osd.11. > > When I manually replaced the disk, I added the OSD to the cluster with a > CRUSH weight of 0. > > As soon as I changed fixed the CRUSH weight, some PGs were allocated to the > OSD, and the probing completed. My PG that was stuck in recovery mode for > 24h has been remapped to be on osd.11. I believe this will allow the > recovery to complete. > > Glad you worked it out. That sounds odd to me, though. Do you have any > logs from osd.11? > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > Sure, but I don't think they'll be very helpful. I only had the default > logging levels. Here are the logs from today: > https://cd.centraldesktop.com/p/eAAAAAAADQ70AAAAAEBvDJY > > > At 2014-04-08 16:15, I restarted the OSD. That was to force the stalled > recovery to yield to another recovery/backfill. It seems to get hung up > every so often. Whenever I only saw this one PG in recovery state for more > than 15 minutes, I'd restart osd.11, and it would recover/backfill other PGs > for another ~12 hours. It's probably not helping that I have max backfills > set to 1. > > > I didn't record the exact time, but I ran a few of these, trying to zero in > on the right weight for the device. The final command was: > ceph osd crush reweight osd.11 3.64 > around 17:00 PDT (timezone in the logs). Since the logs show a scrub > starting at 2014-04-08 16:50:40.682409, so I'd say it was just before that. > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com