YES! One of my goal for the fault-injection in our system tests is that whoever fixes the issue will also add tests to make sure it stays fixed.
On Wed, Oct 5, 2016 at 11:33 AM, Tom Crayford <tcrayf...@heroku.com> wrote: > I did some stuff like this recently with simple calls to `tc` (samples that > I used were in the README for https://github.com/tylertreat/comcast). The > only notable bug I found so far is that if you cut all the kafka nodes > entirely off from zookeeper for say, 60 seconds, then reconnect them, the > nodes don't crash, they report as healthy in JMX, but calls to fetch > metadata from them timeout entirely. That can be fixed with a rolling > restart, but it doesn't sound ideal (especially in the face of cloud > networks, where short-lived total network outages can and do happen). > Should I file a Jira detailing that bug? > > On Wed, Oct 5, 2016 at 7:26 PM, Gwen Shapira <g...@confluent.io> wrote: > >> Yeah, totally agree on discussing what we want to test first and >> implement anything later :) >> >> Its just that whenever I have this discussion Jepsen came up, so I was >> curious what was driving the interest and whether the specific >> framework is important to the community. >> >> On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jjkosh...@gmail.com> wrote: >> > Hi Gwen, >> > >> > I've also seen suggestions of using Jepsen for fault injection, but >> >> I'm not familiar with this framework. >> >> >> >> What do you guys think? Write our own failure injection? or write >> >> Kafka tests in Jepsen? >> >> >> > >> > This would definitely add a lot of value and save a lot on release >> > validation overheads. I have heard of Jepsen (via the blog), but haven't >> > used it. At LinkedIn a couple of infra teams have been using Simoorg >> > <https://github.com/linkedin/simoorg> which being python-based would >> > perhaps be easier to use for system test writers than Clojure (under >> > Jepsen). The Ambry <https://github.com/linkedin/ambry> project at >> LinkedIn >> > uses it extensively (and I think has added several more failure scenarios >> > which don't seem to be reflected in the github repo). Anyway, I think we >> > should at least enumerate what we want to test and evaluate the >> > alternatives before reinventing. >> > >> > Thanks, >> > >> > Joel >> >> >> >> -- >> Gwen Shapira >> Product Manager | Confluent >> 650.450.2760 | @gwenshap >> Follow us: Twitter | blog >> -- Gwen Shapira Product Manager | Confluent 650.450.2760 | @gwenshap Follow us: Twitter | blog