The same thing happens to my setup with CentOS7.x + non-stock kernel
(kernel-ml from elrepo).
I was not happy with IOPS I got out of the stock CentOS7.x so I did the
kernel upgrade and crashes started to happen until some of the OSDs
become non-bootable at all. The funny thing is that I was not able to
downgrade back to stock since OSDs were crashing with 'cannot decode'
errors. I am doing backup at the moment and OSDs crash from time to
time due to the ceph watchdog despite the x20 timeouts.
I believe the version of kernel-ml I have started with was 3.19.
On Tue, Dec 8, 2015 at 10:34 AM, Tom Christensen <pav...@gmail.com>
wrote:
We didn't go forward to 4.2 as its a large production cluster, and we
just needed the problem fixed. We'll probably test out 4.2 in the
next couple months, but this one slipped past us as it didn't occur
in our test cluster until after we had upgraded production. In our
experience it takes about 2 weeks to start happening, but once it
does its all hands on deck cause nodes are going to go down regularly.
All that being said, if/when we try 4.2 its going to need to run for
1-2 months rock solid in our test cluster before it gets to
production.
On Tue, Dec 8, 2015 at 2:30 AM, Benedikt Fraunhofer
<fraunho...@traced.net> wrote:
Hi Tom,
> We have been seeing this same behavior on a cluster that has been
perfectly
> happy until we upgraded to the ubuntu vivid 3.19 kernel. We are
in the
i can't recall when we gave 3.19 a shot but now that you say it...
The
cluster was happy for >9 months with 3.16.
Did you try 4.2 or do you think the regression from 3.16 introduced
somewhere trough 3.19 is still in 4.2?
Thx!
Benedikt
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com