David Jacot created KAFKA-20634:
-----------------------------------

             Summary: Spurious HighWatermarkUpdate failed errors in the group 
coordinator after partition leadership change
                 Key: KAFKA-20634
                 URL: https://issues.apache.org/jira/browse/KAFKA-20634
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 4.3.0, 4.2.0, 4.1.0, 4.0.0
            Reporter: David Jacot
            Assignee: David Jacot


During routine __consumer_offsets partition leadership changes, the group 
coordinator spams ERROR-level logs like:

{noformat}
[GroupCoordinator id=N] Execution of HighWatermarkUpdate failed due to New 
committed offset X of __consumer_offsets-N must be less than or equal to Y.
[GroupCoordinator id=N] Execution of HighWatermarkUpdate failed due to No 
in-memory snapshot for epoch X. Snapshot epochs are: Y.
{noformat}

These appear on the group coordinator that lost leadership of a 
__consumer_offsets partition and last a few seconds. The exceptions are caught 
inside CoordinatorInternalEvent and don't propagate to clients, but they create 
unnecessary and confusing noise.

Root cause: when a partition transitions to follower, the local log gets 
truncated and replicates from the new leader, advancing HWM. The group 
coordinator stays ACTIVE until scheduleUnloadOperation runs (async). In that 
window the HWM listener fires with offsets that don't match the coordinator's 
write boundaries, violating invariants in 
SnapshottableCoordinator.updateLastCommittedOffset and in 
SnapshotRegistry.getSnapshot, and hence resulting in IllegalStateExceptions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to