[ 
https://issues.apache.org/jira/browse/KAFKA-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004110#comment-15004110
 ] 

Ralph Tice edited comment on KAFKA-2082 at 11/13/15 9:12 PM:
-------------------------------------------------------------

I just ran into this error on 0.8.2.1,
Is this issue still being worked?

Edit: sorry, previous version of this comment included an issue caused by a 
hostname conflict that created a cluster cross-talk issue.


was (Author: ralph.tice):
I just ran into this error on 0.8.2.1 and one other symptom that I noticed is 
when I stop the 2 healthy brokers that aren't spinning cpu on replication the 
first broker seems to recover since it has become the leader for the partitions 
it was trying to fetch.  However, while the 2 healthy brokers are stopped, for 
topics with 12 partitions every 3rd partition showed up in the ISR like so:
{code}
Topic:mytopic   PartitionCount:12       ReplicationFactor:3     Configs:
        Topic: mytopic                  Partition: 0    Leader: 0       
Replicas: 0,2,1 Isr: 0
        Topic: mytopic                  Partition: 1    Leader: 0       
Replicas: 1,0,2 Isr: 0,2,1
        Topic: mytopic                  Partition: 2    Leader: 0       
Replicas: 2,1,0 Isr: 0
        Topic: mytopic                  Partition: 3    Leader: 0       
Replicas: 0,1,2 Isr: 0
        Topic: mytopic                  Partition: 4    Leader: 0       
Replicas: 1,2,0 Isr: 0,2,1
        Topic: mytopic                  Partition: 5    Leader: 0       
Replicas: 2,0,1 Isr: 0
        Topic: mytopic                  Partition: 6    Leader: 0       
Replicas: 0,2,1 Isr: 0
        Topic: mytopic                  Partition: 7    Leader: 0       
Replicas: 1,0,2 Isr: 0,2,1
        Topic: mytopic                  Partition: 8    Leader: 0       
Replicas: 2,1,0 Isr: 0
        Topic: mytopic                  Partition: 9    Leader: 0       
Replicas: 0,1,2 Isr: 0
        Topic: mytopic                  Partition: 10   Leader: 0       
Replicas: 1,2,0 Isr: 0,2,1
        Topic: mytopic                  Partition: 11   Leader: 0       
Replicas: 2,0,1 Isr: 0
{code}

Is this issue still being worked?

> Kafka Replication ends up in a bad state
> ----------------------------------------
>
>                 Key: KAFKA-2082
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2082
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.8.2.1
>            Reporter: Evan Huus
>            Assignee: Sriharsha Chintalapani
>            Priority: Critical
>              Labels: zkclient-problems
>         Attachments: KAFKA-2082.patch
>
>
> While running integration tests for Sarama (the go client) we came across a 
> pattern of connection losses that reliably puts kafka into a bad state: 
> several of the brokers start spinning, chewing ~30% CPU and spamming the logs 
> with hundreds of thousands of lines like:
> {noformat}
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,1] failed due to Leader not local for partition 
> [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111094 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,6] failed due to Leader not local for partition 
> [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,070] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,21] failed due to Leader not local for partition 
> [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,26] failed due to Leader not local for partition 
> [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,1] failed due to Leader not local for partition 
> [many_partition,1] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,071] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111095 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,6] failed due to Leader not local for partition 
> [many_partition,6] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,21] failed due to Leader not local for partition 
> [many_partition,21] on broker 9093 (kafka.server.ReplicaManager)
> [2015-04-01 13:08:40,072] WARN [Replica Manager on Broker 9093]: Fetch 
> request with correlation id 111096 from client ReplicaFetcherThread-0-9093 on 
> partition [many_partition,26] failed due to Leader not local for partition 
> [many_partition,26] on broker 9093 (kafka.server.ReplicaManager)
> {noformat}
> This can be easily and reliably reproduced using the {{toxiproxy-final}} 
> branch of https://github.com/Shopify/sarama which includes a vagrant script 
> for provisioning the appropriate cluster: 
> - {{git clone https://github.com/Shopify/sarama.git}}
> - {{git checkout test-jira-kafka-2082}}
> - {{vagrant up}}
> - {{TEST_SEED=1427917826425719059 DEBUG=true go test -v}}
> After the test finishes (it fails because the cluster ends up in a bad 
> state), you can log into the cluster machine with {{vagrant ssh}} and inspect 
> the bad nodes. The vagrant script provisions five zookeepers and five brokers 
> in {{/opt/kafka-9091/}} through {{/opt/kafka-9095/}}.
> Additional context: the test produces continually to the cluster while 
> randomly cutting and restoring zookeeper connections (all connections to 
> zookeeper are run through a simple proxy on the same vm to make this easy). 
> The majority of the time this works very well and does a good job exercising 
> our producer's retry and failover code. However, under certain patterns of 
> connection loss (the {{TEST_SEED}} in the instructions is important), kafka 
> gets confused. The test never cuts more than two connections at a time, so 
> zookeeper should always have quorum, and the topic (with three replicas) 
> should always be writable.
> Completely restarting the cluster via {{vagrant reload}} seems to put it back 
> into a sane state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to