Have you tuned any of the recovery or backfill parameters?  My ceph.conf
has:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

Still, if it's running for a few hours, then failing, it sounds like there
might be something else at play.  OSDs use a lot of RAM during recovery.
How much RAM and how many OSDs do you have in these nodes?  What does
memory usage look like after a fresh restart, and what does it look like
when the problems start?  Even better if you know what it looks like 5
minutes before the problems start.

Is there anything interesting in the kernel logs?  OOM killers, or memory
deadlocks?



On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg <e...@logtenberg.eu> wrote:

> Hi,
>
> I have some OSD's that keep committing suicide. My cluster has ~1.3M
> misplaced objects, and it can't really recover, because OSD's keep
> failing before recovering finishes. The load on the hosts is quite high,
> but the cluster currently has no other tasks than just the
> backfilling/recovering.
>
> I attached the logfile from a failed OSD. It shows the suicide, the
> recent events and also me starting the OSD again after some time.
>
> It'll keep running for a couple of hours and then fail again, for the
> same reason.
>
> I noticed a lot of timeouts. Apparently ceph stresses the hosts to the
> limit with the recovery tasks, so much that they timeout and can't
> finish that task. I don't understand why. Can I somehow throttle ceph a
> bit so that it doesn't keep overrunning itself? I kinda feel like it
> should chill out a bit and simply recover one step at a time instead of
> full force and then fail.
>
> Thanks,
>
> Erik.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to