Have you tuned any of the recovery or backfill parameters? My ceph.conf has: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1
Still, if it's running for a few hours, then failing, it sounds like there might be something else at play. OSDs use a lot of RAM during recovery. How much RAM and how many OSDs do you have in these nodes? What does memory usage look like after a fresh restart, and what does it look like when the problems start? Even better if you know what it looks like 5 minutes before the problems start. Is there anything interesting in the kernel logs? OOM killers, or memory deadlocks? On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg <e...@logtenberg.eu> wrote: > Hi, > > I have some OSD's that keep committing suicide. My cluster has ~1.3M > misplaced objects, and it can't really recover, because OSD's keep > failing before recovering finishes. The load on the hosts is quite high, > but the cluster currently has no other tasks than just the > backfilling/recovering. > > I attached the logfile from a failed OSD. It shows the suicide, the > recent events and also me starting the OSD again after some time. > > It'll keep running for a couple of hours and then fail again, for the > same reason. > > I noticed a lot of timeouts. Apparently ceph stresses the hosts to the > limit with the recovery tasks, so much that they timeout and can't > finish that task. I don't understand why. Can I somehow throttle ceph a > bit so that it doesn't keep overrunning itself? I kinda feel like it > should chill out a bit and simply recover one step at a time instead of > full force and then fail. > > Thanks, > > Erik. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com