Re: [DISCUSS] Fault injection tests for Kafka

2016-10-05 Thread Gwen Shapira
YES! One of my goal for the fault-injection in our system tests is that whoever fixes the issue will also add tests to make sure it stays fixed. On Wed, Oct 5, 2016 at 11:33 AM, Tom Crayford wrote: > I did some stuff like this recently with simple calls to `tc` (samples that > I used were in the

Re: [DISCUSS] Fault injection tests for Kafka

2016-10-05 Thread Tom Crayford
I did some stuff like this recently with simple calls to `tc` (samples that I used were in the README for https://github.com/tylertreat/comcast). The only notable bug I found so far is that if you cut all the kafka nodes entirely off from zookeeper for say, 60 seconds, then reconnect them, the node

Re: [DISCUSS] Fault injection tests for Kafka

2016-10-05 Thread Gwen Shapira
Yeah, totally agree on discussing what we want to test first and implement anything later :) Its just that whenever I have this discussion Jepsen came up, so I was curious what was driving the interest and whether the specific framework is important to the community. On Tue, Oct 4, 2016 at 5:46 P

Re: [DISCUSS] Fault injection tests for Kafka

2016-10-04 Thread radai
for "small" failures (local failures on a single node, like socket disconnection, disk read errors, out of memory etc) I've used byteman before - http://byteman.jboss.org/ On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy wrote: > Hi Gwen, > > I've also seen suggestions of using Jepsen for fault inject

Re: [DISCUSS] Fault injection tests for Kafka

2016-10-04 Thread Joel Koshy
Hi Gwen, I've also seen suggestions of using Jepsen for fault injection, but > I'm not familiar with this framework. > > What do you guys think? Write our own failure injection? or write > Kafka tests in Jepsen? > This would definitely add a lot of value and save a lot on release validation overh

[DISCUSS] Fault injection tests for Kafka

2016-10-03 Thread Gwen Shapira
Hi Team Kafka, I was thinking of enhancing our system tests with some fault injections. You know, drop random packets, partition some nodes, delete disks, maybe play with system clocks. Fun stuff :) I was thinking of adding the fault injection to our system tests, so if someone reports a failure