Hi all, I've been working on a fault injector for Apache Kafka. The general idea is to create faults such as network partitions or disk failures, and see what happens in the cluster. The fault injector can run as part of a ducktape system test, or standalone.
The fault injector has two processes: a coordinator, and an agent. The agent process is responsible for actually implementing the faults. For example, it might run iptables, send signals to processes, generate a lot of load, or do something else to disrupt the computer it is running on. We run an agent process on each node where we would like to potentially inject faults. So it will run alongside the brokers, zookeeper nodes, etc. The coordinator process is responsible for communicating with the agent processes and for scheduling faults. For example, the coordinator can be instructed to create a fault immediately on several nodes. Or it can be instructed to create faults over time, based on a pseudorandom seed. Both the coordinator and the agent expose a REST interface that accepts objects serialized via JSON. I think two kinds of faults will be especially interesting: network faults, and disk errors. Simulating network faults in a Linux environment is relatively straightforward using iptables. Disk errors are tougher to simulate, but I have written a FUSE filesystem to do this. The filesystem essentially simulates a bind mount in most cases, but it can take a JSON specification telling it to inject certain faults. (Disk errors seem especially relevant to the ongoing work on JBOD.) Although it's not a user-visible component, I think having a fault injector will be really great for Kafka users. It will really help us stress test Kafka in more situations. I'm going to post some patches in a day or two-- it would be great to get some feedback. Check out https://cwiki.apache.org/confluence/display/KAFKA/Fault+Injection best, Colin