Travis Bischel created KAFKA-19235: -------------------------------------- Summary: STALE_MEMBER_EPOCH is mostly non-recoverable and forces lost commits when leaving a group (KIP-848) Key: KAFKA-19235 URL: https://issues.apache.org/jira/browse/KAFKA-19235 Project: Kafka Issue Type: Bug Components: clients, consumer Affects Versions: 4.0.0 Reporter: Travis Bischel
Flow: * I heartbeat and receive memberEpoch 7, heartbeat interval 5s * 3s later I want to leave the group * In my OnRevoke before leaving, I commit offsets * The broker has bumped the memberEpoch * My OffsetCommit request fails with STALE_MEMBER_EPOCH I am leaving the group, there will be no future heartbeat (besides the one actually leaving the group with memberEpoch -1 or -2) to get a new epoch so that I can issue a final commit. What I've tried to do locally is force an inline ConsumerGroupHeartbeat if I receive STALE_MEMBER_EPOCH from an OffsetCommit response and then reissue the commit request. Well, Kafka 4 returns FENCED_MEMBER_EPOCH _a lot_, and frequently this forced ConsumerGroupHeartbeat receives FENCED_MEMBER_EPOCH, and thus I cannot update the epoch. Clients are meant to give up all partitions if they experience FENCED_MEMBER_EPOCH and rejoin with a MemberEpoch of 0. Well, we're already in the process of giving up partitions. The commit just can't go through. The Java client looks to just blindly retry the commit without doing anything with the epoch (likely the epoch is handled elsewhere – and, unless something shows me otherwise, the Java client should also be experiencing the FENCED_MEMBER_EPOCH problem if this is being handled elsewhere): [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java#L346-L352] There are some tests in the Java client codebase, but they do not actually test if the commit is successful. The tests simply check that the commit is scheduled to be retried: [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java#L481-L485] -- This message was sent by Atlassian Jira (v8.20.10#820010)