I can confirm it seems to be kernels greater than 3.16, we had this problem where servers would lock up and had to perform restarts on a weekly basis. We downgraded to 3.16, since then we have not had to do any restarts.
I did find this thread in the XFS forums and I am not sure if has been fixed or not http://oss.sgi.com/archives/xfs/2015-07/msg00034.html On Tue, Dec 8, 2015 at 2:06 AM Tom Christensen <pav...@gmail.com> wrote: > We run deep scrubs via cron with a script so we know when deep scrubs are > happening, and we've seen nodes fail both during deep scrubbing and while > no deep scrubs are occurring so I'm pretty sure its not related. > > > On Tue, Dec 8, 2015 at 2:42 AM, Benedikt Fraunhofer <fraunho...@traced.net > > wrote: > >> Hi Tom, >> >> 2015-12-08 10:34 GMT+01:00 Tom Christensen <pav...@gmail.com>: >> >> > We didn't go forward to 4.2 as its a large production cluster, and we >> just >> > needed the problem fixed. We'll probably test out 4.2 in the next >> couple >> >> unfortunately we don't have the luxury of a test cluster. >> and to add to that, we couldnt simulate the load, altough it does not >> seem to be load related. >> Did you try running with nodeep-scrub as a short-term workaround? >> >> I'll give ~30% of the nodes 4.2 and see how it goes. >> >> > In our experience it takes about 2 weeks to start happening >> >> we're well below that. Somewhat between 1 and 4 days. >> And yes, once one goes south, it affects the rest of the cluster. >> >> Thx! >> >> Benedikt >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com