[
https://issues.apache.org/jira/browse/KAFKA-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348485#comment-15348485
]
Piotr Trzpil commented on KAFKA-3296:
-------------------------------------
I can successfully reproduce this issue in AWS EC2, running kafka 0.10.0.0
cluster with 3 brokers with autogenerated IDs.
The problem occurs if I restart all of the brokers (so the restarted brokers
will get new IDs) before reassiging the __consumer_offsets topic partitions
1. Initial state (got using kafkacat):
{noformat}
3 brokers:
broker 1015 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
broker 1016 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
broker 1017 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
topic "__consumer_offsets" with 50 partitions:
partition 23, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
partition 41, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
partition 32, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
partition 8, leader 1016, replicas: 1017,1015,1016, isrs: 1015,1016,1017
{noformat}
2. Then, after restarting all of the brokers, the leaders are not available, as
expected:
{noformat}
3 brokers:
broker 1019 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
broker 1018 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
broker 1020 at ec2-<removed>.eu-west-1.compute.amazonaws.com:9092
...
topic "__consumer_offsets" with 50 partitions:
partition 23, leader -1, replicas: , isrs: , Broker: Leader not available
partition 32, leader -1, replicas: , isrs: , Broker: Leader not available
partition 41, leader -1, replicas: , isrs: , Broker: Leader not available
partition 17, leader -1, replicas: , isrs: , Broker: Leader not available
...
{noformat}
3. However, after generating a reassignment plan for the new brokers and
executing it, the isrs list is empty and the leaders are still not chosen.
{noformat}
topic "__consumer_offsets" with 50 partitions:
partition 23, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not
available
partition 32, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not
available
partition 41, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not
available
partition 17, leader -1, replicas: 1018,1019,1020, isrs: , Broker: Leader not
available
{noformat}
4. Only after restarting one of the brokers (chosen at random) with the same
ID, this situation is fixed:
{noformat}
topic "__consumer_offsets" with 50 partitions:
partition 23, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
partition 32, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
partition 41, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
partition 17, leader 1020, replicas: 1018,1019,1020, isrs: 1020,1018,1019
{noformat}
But, this issue does not happen if only one broker is restarted with a new ID
before executing partition reassignment.
> All consumer reads hang indefinately
> ------------------------------------
>
> Key: KAFKA-3296
> URL: https://issues.apache.org/jira/browse/KAFKA-3296
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.9.0.0, 0.9.0.1
> Reporter: Simon Cooper
> Priority: Critical
> Attachments: controller.zip, kafkalogs.zip
>
>
> We've got several integration tests that bring up systems on VMs for testing.
> We've recently upgraded to 0.9, and very occasionally we occasionally see an
> issue where every consumer that tries to read from the broker hangs, spamming
> the following in their logs:
> {code}2016-02-26T12:25:37,856 | DEBUG | o.a.k.c.NetworkClient
> [pool-10-thread-1] | Sending metadata request
> ClientRequest(expectResponse=true, callback=null,
> request=RequestSend(header={api_key=3,api_version=0,correlation_id=21905,client_id=consumer-1},
> body={topics=[Topic1]}), isInitiatedByNetworkClient,
> createdTimeMs=1456489537856, sendTimeMs=0) to node 1
> 2016-02-26T12:25:37,856 | DEBUG | o.a.k.c.Metadata [pool-10-thread-1] |
> Updated cluster metadata version 10954 to Cluster(nodes = [Node(1,
> server.internal, 9092)], partitions = [Partition(topic = Topic1, partition =
> 0, leader = 1, replicas = [1,], isr = [1,]])
> 2016-02-26T12:25:37,856 | DEBUG | o.a.k.c.c.i.AbstractCoordinator
> [pool-10-thread-1] | Issuing group metadata request to broker 1
> 2016-02-26T12:25:37,857 | DEBUG | o.a.k.c.c.i.AbstractCoordinator
> [pool-10-thread-1] | Group metadata response
> ClientResponse(receivedTimeMs=1456489537857, disconnected=false,
> request=ClientRequest(expectResponse=true,
> callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@28edb273,
>
> request=RequestSend(header={api_key=10,api_version=0,correlation_id=21906,client_id=consumer-1},
> body={group_id=}), createdTimeMs=1456489537856, sendTimeMs=1456489537856),
> responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}})
> 2016-02-26T12:25:37,956 | DEBUG | o.a.k.c.NetworkClient [pool-10-thread-1] |
> Sending metadata request ClientRequest(expectResponse=true, callback=null,
> request=RequestSend(header={api_key=3,api_version=0,correlation_id=21907,client_id=consumer-1},
> body={topics=[Topic1]}), isInitiatedByNetworkClient,
> createdTimeMs=1456489537956, sendTimeMs=0) to node 1
> 2016-02-26T12:25:37,956 | DEBUG | o.a.k.c.Metadata [pool-10-thread-1] |
> Updated cluster metadata version 10955 to Cluster(nodes = [Node(1,
> server.internal, 9092)], partitions = [Partition(topic = Topic1, partition =
> 0, leader = 1, replicas = [1,], isr = [1,]])
> 2016-02-26T12:25:37,956 | DEBUG | o.a.k.c.c.i.AbstractCoordinator
> [pool-10-thread-1] | Issuing group metadata request to broker 1
> 2016-02-26T12:25:37,957 | DEBUG | o.a.k.c.c.i.AbstractCoordinator
> [pool-10-thread-1] | Group metadata response
> ClientResponse(receivedTimeMs=1456489537957, disconnected=false,
> request=ClientRequest(expectResponse=true,
> callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@40cee8cc,
>
> request=RequestSend(header={api_key=10,api_version=0,correlation_id=21908,client_id=consumer-1},
> body={group_id=}), createdTimeMs=1456489537956, sendTimeMs=1456489537956),
> responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}})
> 2016-02-26T12:25:38,056 | DEBUG | o.a.k.c.NetworkClient [pool-10-thread-1] |
> Sending metadata request ClientRequest(expectResponse=true, callback=null,
> request=RequestSend(header={api_key=3,api_version=0,correlation_id=21909,client_id=consumer-1},
> body={topics=[Topic1]}), isInitiatedByNetworkClient,
> createdTimeMs=1456489538056, sendTimeMs=0) to node 1
> 2016-02-26T12:25:38,056 | DEBUG | o.a.k.c.Metadata [pool-10-thread-1] |
> Updated cluster metadata version 10956 to Cluster(nodes = [Node(1,
> server.internal, 9092)], partitions = [Partition(topic = Topic1, partition =
> 0, leader = 1, replicas = [1,], isr = [1,]])
> 2016-02-26T12:25:38,056 | DEBUG | o.a.k.c.c.i.AbstractCoordinator
> [pool-10-thread-1] | Issuing group metadata request to broker 1
> 2016-02-26T12:25:38,057 | DEBUG | o.a.k.c.c.i.AbstractCoordinator
> [pool-10-thread-1] | Group metadata response
> ClientResponse(receivedTimeMs=1456489538057, disconnected=false,
> request=ClientRequest(expectResponse=true,
> callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@439e25fb,
>
> request=RequestSend(header={api_key=10,api_version=0,correlation_id=21910,client_id=consumer-1},
> body={group_id=}), createdTimeMs=1456489538056, sendTimeMs=1456489538056),
> responseBody={error_code=15,coordinator={node_id=-1,host=,port=-1}}){code}
> This persists for any 0.9 consumer trying to read from the topic (we haven't
> confirmed if this is for a single topic or for any topic on the broker). 0.8
> consumers can read from the broker without issues. This is fixed by a broker
> restart.
> This was observed on a single-broker cluster. There were no suspicious log
> messages on the server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)