ableegoldman opened a new pull request #11738:
URL: https://github.com/apache/kafka/pull/11738


   Note: this is just part 1 of the error handling work, and is intended to lay 
the groundwork for identifying and handling tasks that are unstable so that 
they don't affect the ability of the other named topologies to make progress.
   
   For some idea of what exactly these changes are needed to support, the 
current plan is for later PRs to tackle improvements/optimizations such as 
   
   1. doing an iteration without the error task so that the healthy tasks can 
be committed (necessary for eos-v2, for which you can't commit/abort individual 
partitions of a transaction)
   2. implementing true backoff for tasks experiencing frequent or constant 
errors
   3. error categorization (and possibly other heuristics) to enable 
classifying exceptions as "retriable" vs "thread fatal", allowing us to 
optimize the impact of task errors by skipping the thread replacement when the 
blast radius is contained and the thread state is not corrupted (for example, 
we currently apply this optimization already for the specific case of a 
MissingSourceTopicException, since there may be other named topologies that 
aren't missing any topics and can be processed as usual)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to