Yeah, the log's not super helpful, but that and your description give
us something to talk about. Thanks!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 8, 2014 at 8:20 PM, Craig Lewis <cle...@centraldesktop.com> wrote:
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
> On 4/8/14 18:27 , Gregory Farnum wrote:
>
> On Tue, Apr 8, 2014 at 4:57 PM, Craig Lewis <cle...@centraldesktop.com>
> wrote:
>
> pg query says the recovery state is:
>           "might_have_unfound": [
>                 { "osd": 11,
>                   "status": "querying"},
>                 { "osd": 13,
>                   "status": "already probed"}],
>
> I figured out why it wasn't probing osd.11.
>
> When I manually replaced the disk, I added the OSD to the cluster with a
> CRUSH weight of 0.
>
> As soon as I changed fixed the CRUSH weight, some PGs were allocated to the
> OSD, and the probing completed.  My PG that was stuck in recovery mode for
> 24h has been remapped to be on osd.11.  I believe this will allow the
> recovery to complete.
>
> Glad you worked it out. That sounds odd to me, though. Do you have any
> logs from osd.11?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> Sure, but I don't think they'll be very helpful.  I only had the default
> logging levels.  Here are the logs from today:
> https://cd.centraldesktop.com/p/eAAAAAAADQ70AAAAAEBvDJY
>
>
> At 2014-04-08 16:15, I restarted the OSD.  That was to force the stalled
> recovery to yield to another recovery/backfill.  It seems to get hung up
> every so often.  Whenever I only saw this one PG in recovery state for more
> than 15 minutes, I'd restart osd.11, and it would recover/backfill other PGs
> for another ~12 hours.  It's probably not helping that I have max backfills
> set to 1.
>
>
> I didn't record the exact time, but I ran a few of these, trying to zero in
> on the right weight for the device.  The final command was:
> ceph osd crush reweight osd.11 3.64
> around 17:00 PDT (timezone in the logs).  Since the logs show a scrub
> starting at 2014-04-08 16:50:40.682409, so I'd say it was just before that.
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to