David Jacot created KAFKA-20634:
-----------------------------------
Summary: Spurious HighWatermarkUpdate failed errors in the group
coordinator after partition leadership change
Key: KAFKA-20634
URL: https://issues.apache.org/jira/browse/KAFKA-20634
Project: Kafka
Issue Type: Bug
Affects Versions: 4.3.0, 4.2.0, 4.1.0, 4.0.0
Reporter: David Jacot
Assignee: David Jacot
During routine __consumer_offsets partition leadership changes, the group
coordinator spams ERROR-level logs like:
{noformat}
[GroupCoordinator id=N] Execution of HighWatermarkUpdate failed due to New
committed offset X of __consumer_offsets-N must be less than or equal to Y.
[GroupCoordinator id=N] Execution of HighWatermarkUpdate failed due to No
in-memory snapshot for epoch X. Snapshot epochs are: Y.
{noformat}
These appear on the group coordinator that lost leadership of a
__consumer_offsets partition and last a few seconds. The exceptions are caught
inside CoordinatorInternalEvent and don't propagate to clients, but they create
unnecessary and confusing noise.
Root cause: when a partition transitions to follower, the local log gets
truncated and replicates from the new leader, advancing HWM. The group
coordinator stays ACTIVE until scheduleUnloadOperation runs (async). In that
window the HWM listener fires with offsets that don't match the coordinator's
write boundaries, violating invariants in
SnapshottableCoordinator.updateLastCommittedOffset and in
SnapshotRegistry.getSnapshot, and hence resulting in IllegalStateExceptions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)