Any error in the controller/state-change logs? Do you have GC induced ZK session expiration in the brokers?
Thanks, Jun On Wed, Apr 16, 2014 at 9:50 AM, Xinyao Hu <xinya...@dropbox.com> wrote: > Hey all, > > This is xinyao from dropbox and I am trying to push kafka to widely used > inside dropbox. > > I setup up five nodes with 2x replication. The system is running well to > handle all the traffic. The problem I am seeing now is that one > host(probably one leader) being kicked out of ISR. This is some related > information. > > [2014-04-16 15:31:09,596] INFO Closing socket connection to /10.12.99.184. > (kafka.network.Processor) > [2014-04-16 15:31:09,921] INFO Closing socket connection to /10.12.99.184. > (kafka.network.Processor) > [2014-04-16 15:31:13,142] INFO Closing socket connection to /10.12.99.153. > (kafka.network.Processor) > [2014-04-16 15:31:24,101] INFO Closing socket connection to /10.12.62.181. > (kafka.network.Processor) > [2014-04-16 15:31:27,157] INFO Rolled new log segment for > 'photos-mobile_request_user0_logs-0003-4' in 1 ms. (kafka.log.Log) > [2014-04-16 15:31:30,709] INFO Rolled new log segment for > 'api-oauth-unnecessary-token-with-nonempty-secret-3' in 1 ms. > (kafka.log.Log) > [2014-04-16 15:31:32,310] INFO Rolled new log segment for > 'api-files-post-unnecessary-params-3' in 1 ms. (kafka.log.Log) > [2014-04-16 15:31:42,150] INFO Closing socket connection to /10.12.99.184. > (kafka.network.Processor) > [2014-04-16 15:31:42,151] INFO Closing socket connection to /10.12.99.184. > (kafka.network.Processor) > [2014-04-16 15:31:47,557] INFO Closing socket connection to /10.16.105.60. > (kafka.network.Processor) > [2014-04-16 15:31:47,571] INFO Closing socket connection to /10.16.103.58. > (kafka.network.Processor) > [2014-04-16 15:31:47,632] INFO Closing socket connection to /10.16.54.38. > (kafka.network.Processor) > [2014-04-16 15:31:47,879] INFO Closing socket connection to /10.16.102.22. > (kafka.network.Processor) > [2014-04-16 15:31:54,954] INFO Rolled new log segment for > 'api-not_your_cursor-2' in 1 ms. (kafka.log.Log) > [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160. > (kafka.network.Processor) > [2014-04-16 15:31:56,353] INFO Closing socket connection to /10.12.98.160. > (kafka.network.Processor) > [2014-04-16 15:31:57,136] INFO Closing socket connection to /10.12.62.181. > (kafka.network.Processor) > [2014-04-16 15:31:57,137] INFO Closing socket connection to /10.12.62.181. > (kafka.network.Processor) > [2014-04-16 15:32:05,540] INFO Partition > [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Shrinking ISR for > partition [client-fsinfo-from-formatted-traces-0012,0] from 1,2 to 1 > (kafka.cluster.Partition) > [2014-04-16 15:32:05,579] ERROR Conditional update of path > /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state > with data > {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} > and expected > version 0 failed due to > org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = > BadVersion for > /brokers/topics/client-fsinfo-from-formatted-traces-0012/partitions/0/state > (kafka.utils.ZkUtils$) > [2014-04-16 15:32:05,581] INFO Partition > [client-fsinfo-from-formatted-traces-0012,0] on broker 1: Cached zkVersion > [0] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2014-04-16 15:32:05,582] INFO Partition [api-shmodel-non-ssl,4] on broker > 1: Shrinking ISR for partition [api-shmodel-non-ssl,4] from 1,2 to 1 > (kafka.cluster.Partition) > [2014-04-16 15:32:05,596] ERROR Conditional update of path > /brokers/topics/api-shmodel-non-ssl/partitions/4/state with data > {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} > and expected version 0 failed due > to org.apache.zookeeper.KeeperException$BadVersionException: > KeeperErrorCode = BadVersion for > /brokers/topics/api-shmodel-non-ssl/partitions/4/state > (kafka.utils.ZkUtils$) > [2014-04-16 15:32:05,596] INFO Partition [api-shmodel-non-ssl,4] on broker > 1: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2014-04-16 15:32:05,596] INFO Partition > [server-network_logs_errors-0012,0] on broker 1: Shrinking ISR for > partition [server-network_logs_errors-0012,0] from 1,2 to 1 > (kafka.cluster.Partition) > [2014-04-16 15:32:05,605] ERROR Conditional update of path > /brokers/topics/server-network_logs_errors-0012/partitions/0/state with > data > {"controller_epoch":1,"leader":1,"version":1,"leader_epoch":0,"isr":[1]} > and expected version 0 > failed due to org.apache.zookeeper.KeeperException$BadVersionException: > KeeperErrorCode = BadVersion for > /brokers/topics/server-network_logs_errors-0012/partitions/0/state > (kafka.utils.ZkUtils$) > [2014-04-16 15:32:05,605] INFO Partition > [server-network_logs_errors-0012,0] on broker 1: Cached zkVersion [0] not > equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > You can see that something went wrong around 15:31. This is the current > topic replicas information: > > Topic:analytics-ab_raw_metric PartitionCount:5 ReplicationFactor:2 Configs: > Topic: analytics-ab_raw_metric Partition: 0 Leader: 3 Replicas: 3,5 Isr: > 3,5 > Topic: analytics-ab_raw_metric Partition: 1 Leader: 4 Replicas: 4,1 Isr: > 1,4 > Topic: analytics-ab_raw_metric Partition: 2 Leader: 5 Replicas: 5,2 Isr: > 5,2 > Topic: analytics-ab_raw_metric Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3 > Topic: analytics-ab_raw_metric Partition: 4 Leader: 2 Replicas: 2,4 Isr: > 2,4 > Topic:analytics-api_log2-0000 PartitionCount:5 ReplicationFactor:2 Configs: > Topic: analytics-api_log2-0000 Partition: 0 Leader: 3 Replicas: 3,5 Isr: > 3,5 > Topic: analytics-api_log2-0000 Partition: 1 Leader: 4 Replicas: 4,1 Isr: > 1,4 > Topic: analytics-api_log2-0000 Partition: 2 Leader: 5 Replicas: 5,2 Isr: > 5,2 > Topic: analytics-api_log2-0000 Partition: 3 Leader: 3 Replicas: 1,3 Isr: 3 > Topic: analytics-api_log2-0000 Partition: 4 Leader: 2 Replicas: 2,4 Isr: > 2,4 > Topic:analytics-api_log2-0001 PartitionCount:5 ReplicationFactor:2 Configs: > Topic: analytics-api_log2-0001 Partition: 0 Leader: 3 Replicas: 1,3 Isr: 3 > Topic: analytics-api_log2-0001 Partition: 1 Leader: 2 Replicas: 2,4 Isr: > 2,4 > Topic: analytics-api_log2-0001 Partition: 2 Leader: 3 Replicas: 3,5 Isr: > 3,5 > Topic: analytics-api_log2-0001 Partition: 3 Leader: 4 Replicas: 4,1 Isr: > 1,4 > Topic: analytics-api_log2-0001 Partition: 4 Leader: 5 Replicas: 5,2 Isr: > 5,2 > Topic:analytics-api_log2-0002 PartitionCount:5 ReplicationFactor:2 Configs: > Topic: analytics-api_log2-0002 Partition: 0 Leader: 5 Replicas: 5,1 Isr: > 5,1 > Topic: analytics-api_log2-0002 Partition: 1 Leader: 2 Replicas: 1,2 Isr: 2 > Topic: analytics-api_log2-0002 Partition: 2 Leader: 2 Replicas: 2,3 Isr: > 2,3 > Topic: analytics-api_log2-0002 Partition: 3 Leader: 3 Replicas: 3,4 Isr: > 3,4 > Topic: analytics-api_log2-0002 Partition: 4 Leader: 4 Replicas: 4,5 Isr: > 5,4 > Topic:analytics-api_log2-0003 PartitionCount:5 ReplicationFactor:2 Configs: > Topic: analytics-api_log2-0003 Partition: 0 Leader: 5 Replicas: 5,4 Isr: > 5,4 > Topic: analytics-api_log2-0003 Partition: 1 Leader: 5 Replicas: 1,5 Isr: 5 > Topic: analytics-api_log2-0003 Partition: 2 Leader: 2 Replicas: 2,1 Isr: > 2,1 > Topic: analytics-api_log2-0003 Partition: 3 Leader: 3 Replicas: 3,2 Isr: > 3,2 > Topic: analytics-api_log2-0003 Partition: 4 Leader: 4 Replicas: 4,3 Isr: > 3,4 > > .... > > I suspect that broker 1 lost leadership for all the topics around that > timestamp but it is in sync as a follower. > > I followed "How to reduce churn" to increase replica.lag.max.messages. > However I don't think it is useful since I suspect it is leader kicked out > of ISR. > > I cannot consistently reproduce this since it happened twice, once a day in > the past two days. These are all the configs I changed in > server.properties. > > default.replication.factor=2 > > auto.leader.rebalance.enable=true > > log.roll.hours=1 > > replica.lag.max.messages=50000 > > Let me know if you need more logs to debug. >