[jira] [Commented] (KAFKA-19233) Members cannot rejoin with epoch=0 for KIP-848

Lianet Magrans (Jira) Mon, 12 May 2025 16:07:31 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951065#comment-17951065
 ]


Lianet Magrans commented on KAFKA-19233:
----------------------------------------

Hey [~twmb] , sorry busy lately but catching up here to help, some answers and 
questions too. On the flow you describe:

Line 1 : OK

Line 2 : OK, but why a full HB after a successful response from the broker on 
Line 1?

Line 3: Why is the client repeating "CONSUMER_GROUP_HEARTBEAT uxNP e10."? If it 
got FENCED_MEMBER_EPOCH on Line 2 from the broker, the client should release 
its assignment, reset the epoch to 0, and only then HB again to rejoin (so HB 
with 0, not 10)

Line 4: This one is unexpected. Could you double check in this case, why is the 
fence error set on the broker? maybe point me to the logs (member epoch is 0, I 
expect it doesn't exist on the broker because you have it fenced/removed) . 
Related to this, on the initial description you shared there was a case where I 
didn't see fencing hapening because of this log you shared:
> "The consumer group member has a smaller member epoch (0) than the one known 
> by the group coordinator (11). The member must abandon all its partitions and 
> rejoin."

If the member is "known by the group coordinator (11)" I expect it's because it 
is still in the group with epoch 0 (so not fenced). Are this logs mixing 
several clients maybe?

This test in the java client covers this area of what a fenced member should do 
[https://github.com/apache/kafka/blob/c28f46459ac26693b3639fca26c0961283c2ee65/clients/src/test/java/org/apache/kafka/clients/consumer/internals/ConsumerHeartbeatRequestManagerTest.java#L839]
 

Please let me know the setup you are running (how many consumers?), and with 
that and the answers to the questions above we can probably sort this out.

Best!

> Members cannot rejoin with epoch=0 for KIP-848
> ----------------------------------------------
>
>                 Key: KAFKA-19233
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19233
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>            Reporter: Travis Bischel
>            Priority: Major
>         Attachments: logs1
>
>
> If a group is on generation > 1 and a member is fenced, the member cannot 
> rejoin until the broker expires the member from the group.
> KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH 
> error, the consumer abandon all its partitions and rejoins with the same 
> member id and the epoch 0.".
> However, the current implementation on the broker throws FENCED_MEMBER_EPOCH 
> if the client provided epoch, when not equal to the current epoch, is 
> anything other than the current epoch - 1.
> Specifically this line: 
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]
> If the current epoch is 13, and I reset to epoch 0, the conditional always 
> throws FENCED_MEMBER_EPOCH.
> Attached are logs of this case, here is a sample of a single log line 
> demonstrating the problem:
> {code:java}
> 2025-05-02 15:23:09,304 
> [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
> kafka.request.logger - Completed 
> request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
>  consumer group member has a smaller member epoch (0) than the one known by 
> the group coordinator (11). The member must abandon all its partitions and 
> rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
> {code}
> The logs show the broker continuously responding errcode 110 for 50s until, 
> I'm assuming, some condition boots the member from the group, such that the 
> next time the broker receives the request, the member is considered new and 
> the request is successful.
> The first heartbeat is duplicated; I noticed that Kafka replies with 
> FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm 
> trying to see if it's possible to work around that. As an aside, between the 
> fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning 
> to not opt into KIP-848 by default until the broker implementation improves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-19233) Members cannot rejoin with epoch=0 for KIP-848

Reply via email to