Nanda Kishore M S created KAFKA-13044: -----------------------------------------
Summary: __consumer_offsets corruption Key: KAFKA-13044 URL: https://issues.apache.org/jira/browse/KAFKA-13044 Project: Kafka Issue Type: Bug Components: offset manager Affects Versions: 2.5.0 Environment: Amazon Linux Kafka Server: 2.5.0, scala version - 2.12 Reporter: Nanda Kishore M S We had an issue where clients are not able to discover a group coordinator and we could see the following from client logs when we tried to read data from a topic via kafka-console-consumer {{[2021-07-06 08:15:14,499] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Sending FindCoordinator request to broker kafka01-broker:9094 (id: 5 rack: us-west-2b) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,504] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Received FindCoordinator response ClientResponse(receivedTimeMs=1625559314504, latencyMs=5, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-test-consumer-group-1-1, correlationId=32), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=3, host='kafka02-broker', port=9094)) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,504] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Discovered group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,504] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery}} {{}} {{ (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,504] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Sending FindCoordinator request to broker kafka01-broker:9094 (id: 5 rack: us-west-2b) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,507] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Received FindCoordinator response ClientResponse(receivedTimeMs=1625559314504, latencyMs=5, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-test-consumer-group-1-1, correlationId=32), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=3, host='kafka02-broker', port=9094)) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,507] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Discovered group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2021-07-06 08:15:14,507] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)}} {{}} {{and }}so on We had a look at __consumer_offsets topic and the data looks a bit weird for 7 partitions where isr set and replica set are mutually exclusive. ./kafka-topics.sh --describe --zookeeper localhost:2181 --topic __consumer_offsets Topic: __consumer_offsets PartitionCount: 50 ReplicationFactor: 3 Configs: compression.type=producer,cleanup.policy=compact,segment.bytes=104857600 Topic: __consumer_offsets Partition: 0 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3 *Topic: __consumer_offsets Partition: 1 Leader: 3 Replicas: 6,4,5 Isr: 3,2* Topic: __consumer_offsets Partition: 2 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6 Topic: __consumer_offsets Partition: 3 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6 *Topic: __consumer_offsets Partition: 4 Leader: 6 Replicas: 3,1,2 Isr: 6,5* Topic: __consumer_offsets Partition: 5 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3 Topic: __consumer_offsets Partition: 6 Leader: 5 Replicas: 5,6,1 Isr: 1,5,6 Topic: __consumer_offsets Partition: 7 Leader: 6 Replicas: 6,1,2 Isr: 1,2,6 Topic: __consumer_offsets Partition: 8 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3 Topic: __consumer_offsets Partition: 9 Leader: 2 Replicas: 2,3,4 Isr: 4,2,3 Topic: __consumer_offsets Partition: 10 Leader: 3 Replicas: 3,4,5 Isr: 4,5,3 Topic: __consumer_offsets Partition: 11 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6 Topic: __consumer_offsets Partition: 12 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3 Topic: __consumer_offsets Partition: 13 Leader: 6 Replicas: 6,4,5 Isr: 4,5,6 Topic: __consumer_offsets Partition: 14 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6 Topic: __consumer_offsets Partition: 15 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6 Topic: __consumer_offsets Partition: 16 Leader: 3 Replicas: 3,1,2 Isr: 1,2,3 Topic: __consumer_offsets Partition: 17 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3 *Topic: __consumer_offsets Partition: 18 Leader: 2 Replicas: 5,1,3 Isr: 2,6* Topic: __consumer_offsets Partition: 19 Leader: 6 Replicas: 6,2,4 Isr: 4,2,6 Topic: __consumer_offsets Partition: 20 Leader: 1 Replicas: 1,3,5 Isr: 1,5,3 *Topic: __consumer_offsets Partition: 21 Leader: 5 Replicas: 2,4,6 Isr: 5,3* Topic: __consumer_offsets Partition: 22 Leader: 3 Replicas: 3,5,1 Isr: 1,5,3 Topic: __consumer_offsets Partition: 23 Leader: 4 Replicas: 4,6,2 Isr: 4,2,6 Topic: __consumer_offsets Partition: 24 Leader: 5 Replicas: 5,4,6 Isr: 4,5,6 Topic: __consumer_offsets Partition: 25 Leader: 6 Replicas: 6,5,1 Isr: 1,5,6 Topic: __consumer_offsets Partition: 26 Leader: 1 Replicas: 1,6,2 Isr: 1,2,6 Topic: __consumer_offsets Partition: 27 Leader: 2 Replicas: 2,1,3 Isr: 1,2,3 Topic: __consumer_offsets Partition: 28 Leader: 3 Replicas: 3,2,4 Isr: 4,2,3 Topic: __consumer_offsets Partition: 29 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3 Topic: __consumer_offsets Partition: 30 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3 *Topic: __consumer_offsets Partition: 31 Leader: 3 Replicas: 6,4,5 Isr: 3,2* Topic: __consumer_offsets Partition: 32 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6 Topic: __consumer_offsets Partition: 33 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6 Topic: __consumer_offsets Partition: 34 Leader: 6 Replicas: 3,1,2 Isr: 6,5 Topic: __consumer_offsets Partition: 35 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3 Topic: __consumer_offsets Partition: 36 Leader: 5 Replicas: 5,6,1 Isr: 1,5,6 Topic: __consumer_offsets Partition: 37 Leader: 6 Replicas: 6,1,2 Isr: 1,2,6 *Topic: __consumer_offsets Partition: 38 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3* Topic: __consumer_offsets Partition: 39 Leader: 2 Replicas: 2,3,4 Isr: 4,2,3 Topic: __consumer_offsets Partition: 40 Leader: 3 Replicas: 3,4,5 Isr: 4,5,3 Topic: __consumer_offsets Partition: 41 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6 Topic: __consumer_offsets Partition: 42 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3 Topic: __consumer_offsets Partition: 43 Leader: 6 Replicas: 6,4,5 Isr: 4,5,6 Topic: __consumer_offsets Partition: 44 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6 Topic: __consumer_offsets Partition: 45 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6 Topic: __consumer_offsets Partition: 46 Leader: 3 Replicas: 3,1,2 Isr: 1,2,3 Topic: __consumer_offsets Partition: 47 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3 *Topic: __consumer_offsets Partition: 48 Leader: 2 Replicas: 5,1,3 Isr: 2,6* Topic: __consumer_offsets Partition: 49 Leader: 6 Replicas: 6,2,4 Isr: 4,2,6 Looking at the source code, in the class {{AbstractCoordinator.java}} {{ client.isUnavailable(coordinator) seem to return true and hence the endless loop.}} {{}} {code:java} protected synchronized boolean ensureCoordinatorReady(final Timer timer) { ... } else if (coordinator != null && client.isUnavailable(coordinator)) { // we found the coordinator, but the connection has failed, so mark // it dead and backoff before retrying discovery markCoordinatorUnknown(); timer.sleep(rebalanceConfig.retryBackoffMs); } {code} {{We are able to find a workaround by re-assigning the highlighted partitions by running kafka-reassign-partitions.sh by replacing replica values with isr values.}} However, we are wondering what would have caused this corruption. The brokers have been running for the past 54 days and we have not done any upgrade recently. -- This message was sent by Atlassian Jira (v8.3.4#803005)