[ https://issues.apache.org/jira/browse/KAFKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391699#comment-14391699 ]
Sriharsha Chintalapani commented on KAFKA-2082: ----------------------------------------------- [~eapache] I am having hard time getting all the brokers up using vagrant up. Currently you have v.vmx["memsize"] = "3072" in vagrant file which seems lower for 5 broker cluster along with zookeepers. I increase to higher value I am able to get the cluster up but still having atleast one broker down before I even start the test. But running the test is not failing for me. --- PASS: TestSyncProducer (0.00s) === RUN TestConcurrentSyncProducer [sarama] 2015/04/01 16:01:51 Initializing new client [sarama] 2015/04/01 16:01:51 Fetching metadata for all topics from broker 127.0.0.1:49466 [sarama] 2015/04/01 16:01:51 Connected to broker 127.0.0.1:49466 [sarama] 2015/04/01 16:01:51 Registered new broker #2 at 127.0.0.1:49467 [sarama] 2015/04/01 16:01:51 Successfully initialized new client [sarama] 2015/04/01 16:01:51 producer/flusher/2 starting up [sarama] 2015/04/01 16:01:51 Connected to broker 127.0.0.1:49467 [sarama] 2015/04/01 16:01:51 Producer shutting down. [sarama] 2015/04/01 16:01:51 producer/flusher/2 shut down [sarama] 2015/04/01 16:01:51 Closing Client [sarama] 2015/04/01 16:01:51 Closed connection to broker 127.0.0.1:49467 [sarama] 2015/04/01 16:01:51 Closed connection to broker 127.0.0.1:49466 --- PASS: TestConcurrentSyncProducer (0.00s) === RUN TestSyncProducerToNonExistingTopic [sarama] 2015/04/01 16:01:51 Initializing new client [sarama] 2015/04/01 16:01:51 Fetching metadata for all topics from broker 127.0.0.1:49470 [sarama] 2015/04/01 16:01:51 Connected to broker 127.0.0.1:49470 [sarama] 2015/04/01 16:01:51 Registered new broker #1 at 127.0.0.1:49470 [sarama] 2015/04/01 16:01:51 Successfully initialized new client [sarama] 2015/04/01 16:01:51 Fetching metadata for [unknown] from broker 127.0.0.1:49470 [sarama] 2015/04/01 16:01:51 Some partitions are leaderless, but we're out of retries [sarama] 2015/04/01 16:01:51 Producer shutting down. [sarama] 2015/04/01 16:01:51 Closing Client [sarama] 2015/04/01 16:01:51 Closed connection to broker 127.0.0.1:49470 --- PASS: TestSyncProducerToNonExistingTopic (0.00s) PASS ok _/Users/schintalapani/code/sarama 44.946s Is the above test fails intermittently or are you seeing this test fail consistently . > Kafka Replication ends up in a bad state > ---------------------------------------- > > Key: KAFKA-2082 > URL: https://issues.apache.org/jira/browse/KAFKA-2082 > Project: Kafka > Issue Type: Bug > Components: replication > Affects Versions: 0.8.2.1 > Reporter: Evan Huus > Assignee: Neha Narkhede > Priority: Critical > > While running integration tests for Sarama (the go client) we came across a > pattern of connection losses that reliably puts kafka into a bad state: > several of the brokers start spinning, chewing ~30% CPU and spamming the logs > with hundreds of thousands of lines like: > {noformat} > [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,1] failed due to Leader not local for partition > [many_partition,1] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,6] failed due to Leader not local for partition > [many_partition,6] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,21] failed due to Leader not local for partition > [many_partition,21] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,26] failed due to Leader not local for partition > [many_partition,26] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,1] failed due to Leader not local for partition > [many_partition,1] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,6] failed due to Leader not local for partition > [many_partition,6] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,21] failed due to Leader not local for partition > [many_partition,21] on broker 9093 (kafka.server.ReplicaManager) > [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch > request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on > partition [many_partition,26] failed due to Leader not local for partition > [many_partition,26] on broker 9093 (kafka.server.ReplicaManager) > {noformat} > This can be easily and reliably reproduced using the {{toxiproxy-final}} > branch of https://github.com/Shopify/sarama which includes a vagrant script > for provisioning the appropriate cluster: > - {{git clone https://github.com/Shopify/sarama.git}} > - {{git checkout toxiproxy-final}} > - {{vagrant up}} > - {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}} > After the test finishes (it fails because the cluster ends up in a bad > state), you can log into the cluster machine with {{vagrant ssh}} and inspect > the bad nodes. The vagrant script provisions five zookeepers and five brokers > in {{/opt/kafka-9091/}} through {{/opt/kafka-9095/}}. > Additional context: the test produces continually to the cluster while > randomly cutting and restoring zookeeper connections (all connections to > zookeeper are run through a simple proxy on the same vm to make this easy). > The majority of the time this works very well and does a good job exercising > our producer's retry and failover code. However, under certain patterns of > connection loss (the {{TEST_SEED}} in the instructions is important), kafka > gets confused. The test never cuts more than two connections at a time, so > zookeeper should always have quorum, and the topic (with three replicas) > should always be writable. > Completely restarting the cluster via {{vagrant reload}} seems to put it back > into a sane state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)