[ https://issues.apache.org/jira/browse/KAFKA-12525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sagar Rao resolved KAFKA-12525. ------------------------------- Resolution: Fixed > Inaccurate task status due to status record interleaving in fast rebalances > in Connect > -------------------------------------------------------------------------------------- > > Key: KAFKA-12525 > URL: https://issues.apache.org/jira/browse/KAFKA-12525 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 2.3.1, 2.4.1, 2.5.1, 2.7.0, 2.6.1 > Reporter: Konstantine Karantasis > Assignee: Sagar Rao > Priority: Major > > When a task is stopped in Connect it produces an {{UNASSIGNED}} status > record. > Equivalently, when a task is started or restarted in Connect it produces an > {{RUNNING}} status record in the Connect status topic. > At the same time rebalances are decoupled from task start and stop. These > operations happen in separate executor outside of the main worker thread that > performs the rebalance. > Normally, any delayed and stale {{UNASSIGNED}} status records are fenced by > the worker that is sending them. This worker is using the > {{StatusBackingStore#putSafe}} method that will reject any stale status > messages (called only for {{UNASSIGNED}} or {{FAILED}}) as long as the worker > is aware of the newer status record that declares a task as {{RUNNING}}. > In cases of fast consecutive rebalances where a task is revoked from one > worker and assigned to another one, it has been observed that there is a > small time window and thus a race condition during which a {{RUNNING}} status > record in the new generation is produced and is immediately followed by a > delayed {{UNASSIGNED}} status record belonging to the same or a previous > generation before the worker that sends this message reads the {{RUNNING}} > status record that corresponds to the latest generation. > A couple of options are available to remediate this race condition. > For example a worker that is has started a task can re-write the {{RUNNING}} > status message in the topic if it reads a stale {{UNASSIGNED}} message from a > previous generation (that should have been fenced). > Another option is to ignore stale {{UNASSIGNED}} message (messages from an > earlier generation than the one in which the task had {{RUNNING}} status). > Worth noting that when this race condition takes place, besides the > inaccurate status representation, the actual execution of the tasks remains > unaffected (e.g. the tasks are running correctly even though they appear as > {{UNASSIGNED}}). -- This message was sent by Atlassian Jira (v8.20.10#820010)