Hi all,

I've been working on a fault injector for Apache Kafka.  The general
idea is to create faults such as network partitions or disk failures,
and see what happens in the cluster.  The fault injector can run as part
of a ducktape system test, or standalone.

The fault injector has two processes: a coordinator, and an agent.  The
agent process is responsible for actually implementing the faults.  For
example, it might run iptables, send signals to processes, generate a
lot of load, or do something else to disrupt the computer it is running
on.  We run an agent process on each node where we would like to
potentially inject faults.  So it will run alongside the brokers,
zookeeper nodes, etc.

The coordinator process is responsible for communicating with the agent
processes and for scheduling faults.  For example, the coordinator can
be instructed to create a fault immediately on several nodes.  Or it can
be instructed to create faults over time, based on a pseudorandom seed. 
Both the coordinator and the agent expose a REST interface that accepts
objects serialized via JSON.

I think two kinds of faults will be especially interesting: network
faults, and disk errors.  Simulating network faults in a Linux
environment is relatively straightforward using iptables.  Disk errors
are tougher to simulate, but I have written a FUSE filesystem to do
this.  The  filesystem essentially simulates a bind mount in most cases,
but it can take a JSON specification telling it to inject certain
faults.  (Disk errors seem especially relevant to the ongoing work on
JBOD.)

Although it's not a user-visible component, I think having a fault
injector will be really great for Kafka users.  It will really help us
stress test Kafka in more situations.  I'm going to post some patches in
a day or two-- it would be great to get some feedback.  Check out
https://cwiki.apache.org/confluence/display/KAFKA/Fault+Injection

best,
Colin

Reply via email to