We're working on fault tolerance testing for our Kafka cluster. I'm trying to 
simulate a full volume for the service, and observe where and why it fails. So 
to start, I did a fallocate -l <bytes> /data/big.file and then used df to 
ensure 0 bytes remained available.

Nothing happened. I assumed because we'd had load running on this cluster for 
days and the retention.hours=72 was too short -- data was likely being deleted 
as new data came in. So, I updated this setting to -1, to disable deletion, and 
restarted this broker. Still, nothing happened (as in, nothing detrimental). To 
make doubly sure, I tried

./kafka-configs.sh --zookeeper 10.0.0.1 --alter --entity-type topics 
--entity-name __consumer_offsets --add-config 'retention.ms=8640000000000'

I applied this setting to all topics. This should make the retention of 10 
days. However, even after 12 hours of full load hitting our cluster, I see no 
evidence that the volume has filled up - Kafka is entirely healthy, at least as 
far as metrics can show me.

What am I missing? Are there other settings I should be modifying? I don't 
necessarily want to break the system for the sake of breaking it, but I do want 
to understand why it's healthy with a completely full data volume.

As for how I'm determining the system is healthy:

  *   ISR remains unchanged, 3 replicas (replica factor of 3)
  *   In the ISR, this broker is still present
  *   This broker is also still the leader for several partitions

Would love some feedback. Thanks!

----------------------------------------
Caleb Tote | Operations Engineer – Gameplay Services

Reply via email to