We're working on fault tolerance testing for our Kafka cluster. I'm trying to simulate a full volume for the service, and observe where and why it fails. So to start, I did a fallocate -l <bytes> /data/big.file and then used df to ensure 0 bytes remained available.
Nothing happened. I assumed because we'd had load running on this cluster for days and the retention.hours=72 was too short -- data was likely being deleted as new data came in. So, I updated this setting to -1, to disable deletion, and restarted this broker. Still, nothing happened (as in, nothing detrimental). To make doubly sure, I tried ./kafka-configs.sh --zookeeper 10.0.0.1 --alter --entity-type topics --entity-name __consumer_offsets --add-config 'retention.ms=8640000000000' I applied this setting to all topics. This should make the retention of 10 days. However, even after 12 hours of full load hitting our cluster, I see no evidence that the volume has filled up - Kafka is entirely healthy, at least as far as metrics can show me. What am I missing? Are there other settings I should be modifying? I don't necessarily want to break the system for the sake of breaking it, but I do want to understand why it's healthy with a completely full data volume. As for how I'm determining the system is healthy: * ISR remains unchanged, 3 replicas (replica factor of 3) * In the ISR, this broker is still present * This broker is also still the leader for several partitions Would love some feedback. Thanks! ---------------------------------------- Caleb Tote | Operations Engineer – Gameplay Services