Hey all, This is xinyao from dropbox and I am trying to push kafka to widely used inside dropbox.
I setup up five nodes with 2x replication. The system is running well to handle all the traffic. The problem I am seeing now is that one host(probably one leader) being kicked out of ISR. This is some related information. [2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184. (kafka.network.Processor) [2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184. (kafka.network.Processor) [2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153. (kafka.network.Processor) [2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181. (kafka.network.Processor) [2014-04-16 15:31:27,157] INFO Rolled new log segment for 'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log) [2014-04-16 15:31:30,709] INFO Rolled new log segment for 'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms. (kafka.log.Log) [2014-04-16 15:31:32,310] INFO Rolled new log segment for 'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log) [2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184. (kafka.network.Processor) [2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184. (kafka.network.Processor) [2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60. (kafka.network.Processor) [2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58. (kafka.network.Processor) [2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38. (kafka.network.Processor) [2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22. (kafka.network.Processor) [2014-04-16 15:31:54,954] INFO Rolled new log segment for 'api-not_your_cursor-2' in 1 ms. (kafka.log.Log) [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160. (kafka.network.Processor) [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160. (kafka.network.Processor) [2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181. (kafka.network.Processor) [2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181. (kafka.network.Processor) [2014-04-16 15:32:05,540] INFO Partition [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1 (kafka.cluster.Partition) [2014-04-16 15:32:05,579] ERROR Conditional update of path /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state with data {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} and expected version 0 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state (kafka.utils.ZkUtils$) [2014-04-16 15:32:05,581] INFO Partition [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker 1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1 (kafka.cluster.Partition) [2014-04-16 15:32:05,596] ERROR Conditional update of path /brokers/topics/api-shmodel-non-ssl/partitions/4/state with data {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} and expected version 0 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/api-shmodel-non-ssl/partitions/4/state (kafka.utils.ZkUtils$) [2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2014-04-16 15:32:05,596] INFO Partition [server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for partition [server-network_logs_errors-0012,0] from 1,2 to 1 (kafka.cluster.Partition) [2014-04-16 15:32:05,605] ERROR Conditional update of path /brokers/topics/server-network_logs_errors-0012/partitions/0/state with data {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} and expected version 0 failed due to org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /brokers/topics/server-network_logs_errors-0012/partitions/0/state (kafka.utils.ZkUtils$) [2014-04-16 15:32:05,605] INFO Partition [server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) You can see that something went wrong around 15:31. This is the current topic replicas information: Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs: Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5 Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4 Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2 Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3 Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4 Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs: Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr: 3,5 Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr: 1,4 Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr: 5,2 Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3 Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr: 2,4 Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs: Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3 Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr: 2,4 Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr: 3,5 Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr: 1,4 Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr: 5,2 Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs: Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr: 5,1 Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2 Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr: 2,3 Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr: 3,4 Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr: 5,4 Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs: Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr: 5,4 Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5 Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr: 2,1 Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr: 3,2 Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr: 3,4 .... I suspect that broker 1 lost leadership for all the topics around that timestamp but it is in sync as a follower. I followed "How to reduce churn" to increase replica.lag.max.messages. However I don't think it is useful since I suspect it is leader kicked out of ISR. I cannot consistently reproduce this since it happened twice, once a day in the past two days. These are all the configs I changed in server.properties. default.replication.factor=2 auto.leader.rebalance.enable=true log.roll.hours=1 replica.lag.max.messages=50000 Let me know if you need more logs to debug.