Hey guys, We decided to use Kafka in our new project, now I spend some time to research how Kafka producer behaves while network connectivity problems.
I had 3 virtual machines(ubuntu 13.04, running on Virtualbox) in one network: 1. Kafka server(0.7.2) + Zookeper. 2. Producer app with default settings. 3. Consumer app. Results of the following tests with default sync producer settings: 1. Condition: Put network down on machine (1) for 20 mins. Result: Producer is working for ~16mins. Consumer does not receive anything. After ~16mins Producer app fails(with java.io.IOException: Connection timed out). Consumer app does not fail. Messages that were generated during 16mins are lost! 2. Condition: Put network down on machine (1) for 5 mins and after 5 mins start network on (1) again. Result: Producer app is working, no exceptions or notification that network was down. Consumer does not receive messages for 5 mins. But when network on (1) is up it receives all messages. There are no messages lost. 3. Condition: put network down on machine (2) for 20 mins. Result: Producer is working for ~16mins. Consumer does not receive anything. After ~16mins Producer app fails(with java.io.IOException: Connection timed out). Consumer app does not fail. Messages that were generated during 16mins are lost! (Same result as in test#1) Kafka and Zookeeper logs that producer is disconnected. 4. Condition: Put network down on machine (2) for 5 mins and after 5 mins start network on (2) again. Result: Producer app is working, no exceptions or notification that network was down. Consumer does not receive messages for 5 mins. But when network on (2) is up it receives all messages.(Same result as in test#2) Kafka and Zookeeper logs that producer is disconnected. 5. Condition: Kill Kafka server(0.7.2) + Zookeper(kill application, do not shutdown network). Result: Producer fails in a few seconds with "kafka.common.NoBrokersForPartitionException: Partition = null" Consumer is still working even after 25 minutes. One more interesting thing. Changing connect.timeout.ms parameter value for producer did not change 16 mins that I have. Played with settings and find out the only way to reduce time for producer to find out that network is down is to change one of two parameters: reconnect.interval, reconnect.time.interval.ms So lets say we change reconnect.time.interval.ms=1000. This means that producer will do reconnect to kafka every 1 second. In this case producer find out that network is down in 1 second. Producer stops sending messages and throw "java.net.ConnectException: Connection timed out". This is the only way that I found out so far. In this case we do not loose too much messages but performance may suffer. Or we can set reconnect.interval=1 so reconnect will happen after each message sent and do not loose messages at all. Then I did testing for Async producer(producer.type=async) The results are dramatic for me, coz producer does not throw any exception. It sends messages and does not fall. I left it running for night and it did not fall though network between kafka server and producer app was down. Playing with async producer config parameters did not help also. My questions are: 1. Where may these 16 mins come from? 2. Are there any best practices to handle network down issues? 3. Why async producer never throws exceptions when network is down? 4. What is the way to check from sync/async producer that messages were really sent?