Hello, Kafka experts I have a production cluster which has three nodes(.100, .101, .102) I am using a C# producer to publish data to kafka brokers, it works for a while but started to lose connection error to 2 nodes of cluster. Here is the C# producer error:
[2015-01-13 01:49:49,786] ERROR [ConsumerFetcherThread-console-consumer-52088_vagrant-ubuntu-trusty-64-1421113533029-20c40ebf-0-101], Error for partition [PofApiTest77,5] to broker 101:class kafka.common.NotLeaderForPartitionException (kafka.consumer.ConsumerFetcherThread) To duplicate this issue, I run a producer test on vagrant to send data, and this is what I get: bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test-rep-three 50000000000 100 -1 acks=1 bootstrap.servers= 10.100.50.100:9092,10.100.50.101:9092,10.100.50.102:9092 buffer.memory=67108864 batch.size=8196 . . . 536403 records sent, 107259.1 records/sec (10.23 MB/sec), 3993.0 ms avg latency, 11306.0 max latency. [2015-01-13 17:49:44,055] WARN Error in I/O with harmful-jar.master/ 10.100.50.102 (org.apache.kafka.common.network.Selector) java.io.EOFException at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62) at org.apache.kafka.common.network.Selector.poll(Selector.java:242) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-13 17:49:44,059] WARN Error in I/O with harmful-jar.master/ 10.100.50.102 (org.apache.kafka.common.network.Selector) java.io.EOFException at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62) at org.apache.kafka.common.network.Selector.poll(Selector.java:242) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-13 17:52:38,384] WARN Error in I/O with voluminous-mass.master/ 10.100.50.101 (org.apache.kafka.common.network.Selector) java.io.EOFException at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62) at org.apache.kafka.common.network.Selector.poll(Selector.java:242) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) Seems the connection was cut off. I tail the kafka/logs/state-change.log [2015-01-13 17:49:49,028] TRACE Broker 102 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100) correlation id 7 from controller 101 epoch 1781 for partition [PofApiTest77,5] (state.change.logger) [2015-01-13 17:49:49,030] TRACE Broker 102 handling LeaderAndIsr request correlationId 7 from controller 101 epoch 1781 starting the become-leader transition for partition [PofApiTest77,5] (state.change.logger) [2015-01-13 17:49:49,032] TRACE Broker 102 stopped fetchers as part of become-leader request from controller 101 epoch 1781 with correlation id 7 for partition [PofApiTest77,5] (state.change.logger) [2015-01-13 17:49:49,040] TRACE Broker 102 completed LeaderAndIsr request correlationId 7 from controller 101 epoch 1781 for the become-leader transition for partition [PofApiTest77,5] (state.change.logger) [2015-01-13 17:49:49,042] TRACE Broker 102 cached leader info (LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100) for partition [PofApiTest77,5] in response to UpdateMetadata request sent by controller 101 epoch 1781 with correlation id 7 (state.change.logger) [2015-01-13 17:49:49,045] TRACE Broker 102 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:529,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101) correlation id 8 from controller 101 epoch 1781 for partition [test-rep-three,5] (state.change.logger) [2015-01-13 17:49:49,045] TRACE Broker 102 handling LeaderAndIsr request correlationId 8 from controller 101 epoch 1781 starting the become-leader transition for partition [test-rep-three,5] (state.change.logger) [2015-01-13 17:49:49,048] TRACE Broker 102 stopped fetchers as part of become-leader request from controller 101 epoch 1781 with correlation id 8 for partition [test-rep-three,5] (state.change.logger) [2015-01-13 17:49:49,049] TRACE Broker 102 completed LeaderAndIsr request correlationId 8 from controller 101 epoch 1781 for the become-leader transition for partition [test-rep-three,5] (state.change.logger) [2015-01-13 17:49:49,051] TRACE Broker 102 cached leader info (LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:529,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101) for partition [test-rep-three,5] in response to UpdateMetadata request sent by controller 101 epoch 1781 with correlation id 8 (state.change.logger) [2015-01-13 17:49:49,053] TRACE Broker 102 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:528,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100) correlation id 9 from controller 101 epoch 1781 for partition [test-rep-three,2] (state.change.logger) [2015-01-13 17:49:49,053] TRACE Broker 102 handling LeaderAndIsr request correlationId 9 from controller 101 epoch 1781 starting the become-leader transition for partition [test-rep-three,2] (state.change.logger) [2015-01-13 17:49:49,054] TRACE Broker 102 stopped fetchers as part of become-leader request from controller 101 epoch 1781 with correlation id 9 for partition [test-rep-three,2] (state.change.logger) [2015-01-13 17:49:49,055] TRACE Broker 102 completed LeaderAndIsr request correlationId 9 from controller 101 epoch 1781 for the become-leader transition for partition [test-rep-three,2] (state.change.logger) [2015-01-13 17:49:49,057] TRACE Broker 102 cached leader info (LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:528,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100) for partition [test-rep-three,2] in response to UpdateMetadata request sent by controller 101 epoch 1781 with correlation id 9 (state.change.logger) [2015-01-13 17:49:49,058] TRACE Broker 102 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:102,ISR:100,101,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101) correlation id 10 from controller 101 epoch 1781 for partition [PofApiTest77,2] (state.change.logger) [2015-01-13 17:49:49,058] TRACE Broker 102 handling LeaderAndIsr request correlationId 10 from controller 101 epoch 1781 starting the become-leader transition for partition [PofApiTest77,2] (state.change.logger) [2015-01-13 17:49:49,058] TRACE Broker 102 stopped fetchers as part of become-leader request from controller 101 epoch 1781 with correlation id 10 for partition [PofApiTest77,2] (state.change.logger) [2015-01-13 17:49:49,059] TRACE Broker 102 completed LeaderAndIsr request correlationId 10 from controller 101 epoch 1781 for the become-leader transition for partition [PofApiTest77,2] (state.change.logger) [2015-01-13 17:49:49,060] TRACE Broker 102 cached leader info (LeaderAndIsrInfo:(Leader:102,ISR:100,101,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101) for partition [PofApiTest77,2] in response to UpdateMetadata request sent by controller 101 epoch 1781 with correlation id 10 (state.change.logger) Does anyone have similar issue to lose network connection between nodes? thanks -- Alec Li