Travis Bischel created KAFKA-19235:
--------------------------------------

             Summary: STALE_MEMBER_EPOCH is mostly non-recoverable and forces 
lost commits when leaving a group (KIP-848)
                 Key: KAFKA-19235
                 URL: https://issues.apache.org/jira/browse/KAFKA-19235
             Project: Kafka
          Issue Type: Bug
          Components: clients, consumer
    Affects Versions: 4.0.0
            Reporter: Travis Bischel


Flow:
* I heartbeat and receive memberEpoch 7, heartbeat interval 5s
* 3s later I want to leave the group
* In my OnRevoke before leaving, I commit offsets
* The broker has bumped the memberEpoch
* My OffsetCommit request fails with STALE_MEMBER_EPOCH

I am leaving the group, there will be no future heartbeat (besides the one 
actually leaving the group with memberEpoch -1 or -2) to get a new epoch so 
that I can issue a final commit.

What I've tried to do locally is force an inline ConsumerGroupHeartbeat if I 
receive STALE_MEMBER_EPOCH from an OffsetCommit response and then reissue the 
commit request. Well, Kafka 4 returns FENCED_MEMBER_EPOCH _a lot_, and 
frequently this forced ConsumerGroupHeartbeat receives FENCED_MEMBER_EPOCH, and 
thus I cannot update the epoch.

 

Clients are meant to give up all partitions if they experience 
FENCED_MEMBER_EPOCH and rejoin with a MemberEpoch of 0. Well, we're already in 
the process of giving up partitions. The commit just can't go through.

 

The Java client looks to just blindly retry the commit without doing anything 
with the epoch (likely the epoch is handled elsewhere – and, unless something 
shows me otherwise, the Java client should also be experiencing the 
FENCED_MEMBER_EPOCH problem if this is being handled elsewhere):

[https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java#L346-L352]

There are some tests in the Java client codebase, but they do not actually test 
if the commit is successful. The tests simply check that the commit is 
scheduled to be retried:

[https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java#L481-L485]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to