Travis Bischel created KAFKA-19233:
--------------------------------------

             Summary: Members cannot rejoin with epoch=0 for KIP-848
                 Key: KAFKA-19233
                 URL: https://issues.apache.org/jira/browse/KAFKA-19233
             Project: Kafka
          Issue Type: Bug
            Reporter: Travis Bischel
         Attachments: logs1

If a group is on generation > 1 and a member is fenced, the member cannot 
rejoin until the broker expires the member from the group.

KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH 
error, the consumer abandon all its partitions and rejoins with the same member 
id and the epoch 0.".

However, the current implementation on the broker throws FENCED_LEADER_EPOCH if 
the client provided epoch, when not equal to the current epoch, is anything 
other than the current epoch - 1.

Specifically this line: 
https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535

If the current epoch is 13, and I reset to epoch 0, the conditional always 
throws FENCED_LEADER_EPOCH.

Attached are logs of this case, here is a sample of a single log line 
demonstrating the problem:

{code}
2025-05-02 15:23:09,304 
[data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
kafka.request.logger - Completed 
request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
 consumer group member has a smaller member epoch (0) than the one known by the 
group coordinator (11). The member must abandon all its partitions and 
rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
{code}

The logs show the broker continuously responding errcode 110 for 50s until, I'm 
assuming, some condition boots the member from the group, such that the next 
time the broker receives the request, the member is considered new and the 
request is successful.

The first heartbeat is duplicated; I noticed that Kafka replies with 
FENCED_LEADER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm 
trying to see if it's possible to work around that. As an aside, between the 
fenced error happening _a lot_, this issue, and KAFKA-19222, I'm leaning to not 
opt into KIP-848 by default until the broker implementation improves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to