Evan Huus created KAFKA-2082: -------------------------------- Summary: Kafka Replication ends up in a bad state Key: KAFKA-2082 URL: https://issues.apache.org/jira/browse/KAFKA-2082 Project: Kafka Issue Type: Bug Components: replication Affects Versions: 0.8.2.1 Reporter: Evan Huus Assignee: Neha Narkhede Priority: Critical
While running integration tests for Sarama (the go client) we came across a pattern of connection losses that reliably puts kafka into a bad state: several of the brokers start spinning, chewing ~30% CPU and spamming the logs with hundreds of thousands of lines like: {noformat} [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition [many_partition,1] failed due to Leader not local for partition [many_partition,1] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on partition [many_partition,6] failed due to Leader not local for partition [many_partition,6] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,21] failed due to Leader not local for partition [many_partition,21] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,26] failed due to Leader not local for partition [many_partition,26] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,1] failed due to Leader not local for partition [many_partition,1] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on partition [many_partition,6] failed due to Leader not local for partition [many_partition,6] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition [many_partition,21] failed due to Leader not local for partition [many_partition,21] on broker 9093 (kafka.server.ReplicaManager) [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on partition [many_partition,26] failed due to Leader not local for partition [many_partition,26] on broker 9093 (kafka.server.ReplicaManager) {noformat} This can be easily and reliably reproduced using the {{toxiproxy-final}} branch of https://github.com/Shopify/sarama which includes a vagrant script for provisioning the appropriate cluster: - {{git clone https://github.com/Shopify/sarama.git}} - {{git checkout toxiproxy-final}} - {{vagrant up}} - {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}} After the test finishes (it fails because the cluster ends up in a bad state), you can log into the cluster machine with {{vagrant ssh}} and inspect the bad nodes. The vagrant script provisions five brokers in {{/opt/kafka-9091/}} through {{/opt/kafka/9095/}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)