Travis Bischel created KAFKA-19233: -------------------------------------- Summary: Members cannot rejoin with epoch=0 for KIP-848 Key: KAFKA-19233 URL: https://issues.apache.org/jira/browse/KAFKA-19233 Project: Kafka Issue Type: Bug Reporter: Travis Bischel Attachments: logs1
If a group is on generation > 1 and a member is fenced, the member cannot rejoin until the broker expires the member from the group. KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH error, the consumer abandon all its partitions and rejoins with the same member id and the epoch 0.". However, the current implementation on the broker throws FENCED_LEADER_EPOCH if the client provided epoch, when not equal to the current epoch, is anything other than the current epoch - 1. Specifically this line: https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535 If the current epoch is 13, and I reset to epoch 0, the conditional always throws FENCED_LEADER_EPOCH. Attached are logs of this case, here is a sample of a single log line demonstrating the problem: {code} 2025-05-02 15:23:09,304 [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG kafka.request.logger - Completed request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The consumer group member has a smaller member epoch (0) than the one known by the group coordinator (11). The member must abandon all its partitions and rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}} {code} The logs show the broker continuously responding errcode 110 for 50s until, I'm assuming, some condition boots the member from the group, such that the next time the broker receives the request, the member is considered new and the request is successful. The first heartbeat is duplicated; I noticed that Kafka replies with FENCED_LEADER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm trying to see if it's possible to work around that. As an aside, between the fenced error happening _a lot_, this issue, and KAFKA-19222, I'm leaning to not opt into KIP-848 by default until the broker implementation improves. -- This message was sent by Atlassian Jira (v8.20.10#820010)