YES!

One of my goal for the fault-injection in our system tests is that
whoever fixes the issue will also add tests to make sure it stays
fixed.

On Wed, Oct 5, 2016 at 11:33 AM, Tom Crayford <tcrayf...@heroku.com> wrote:
> I did some stuff like this recently with simple calls to `tc` (samples that
> I used were in the README for https://github.com/tylertreat/comcast). The
> only notable bug I found so far is that if you cut all the kafka nodes
> entirely off from zookeeper for say, 60 seconds, then reconnect them, the
> nodes don't crash, they report as healthy in JMX, but calls to fetch
> metadata from them timeout entirely. That can be fixed with a rolling
> restart, but it doesn't sound ideal (especially in the face of cloud
> networks, where short-lived total network outages can and do happen).
> Should I file a Jira detailing that bug?
>
> On Wed, Oct 5, 2016 at 7:26 PM, Gwen Shapira <g...@confluent.io> wrote:
>
>> Yeah, totally agree on discussing what we want to test first and
>> implement anything later :)
>>
>> Its just that whenever I have this discussion Jepsen came up, so I was
>> curious what was driving the interest and whether the specific
>> framework is important to the community.
>>
>> On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jjkosh...@gmail.com> wrote:
>> > Hi Gwen,
>> >
>> > I've also seen suggestions of using Jepsen for fault injection, but
>> >> I'm not familiar with this framework.
>> >>
>> >> What do you guys think? Write our own failure injection? or write
>> >> Kafka tests in Jepsen?
>> >>
>> >
>> > This would definitely add a lot of value and save a lot on release
>> > validation overheads. I have heard of Jepsen (via the blog), but haven't
>> > used it. At LinkedIn a couple of infra teams have been using Simoorg
>> > <https://github.com/linkedin/simoorg> which being python-based would
>> > perhaps be easier to use for system test writers than Clojure (under
>> > Jepsen). The Ambry <https://github.com/linkedin/ambry> project at
>> LinkedIn
>> > uses it extensively (and I think has added several more failure scenarios
>> > which don't seem to be reflected in the github repo). Anyway, I think we
>> > should at least enumerate what we want to test and evaluate the
>> > alternatives before reinventing.
>> >
>> > Thanks,
>> >
>> > Joel
>>
>>
>>
>> --
>> Gwen Shapira
>> Product Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Reply via email to