[ https://issues.apache.org/jira/browse/KAFKA-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989195#comment-16989195 ]
John Roesler commented on KAFKA-9274: ------------------------------------- Thanks for starting this, [~bchen225242]! One thing we should document is that the current approach is actually intentional. Streams essentially delegates the handling of transient failures into the clients (Consumer and Producer and Admin). Accordingly, the current advice is to set the client timeouts high enough to tolerate transient outages. However, I agree with you that this is not the best approach. Setting the timeout high enough to be resilient to most network outages has the perverse effect of allowing StreamThreads to get hung up on an individual operation for that amount of time, harming the overall ability of the system to make progress on all inputs. For example, what if only one broker is unhealthy/unavailable? Rather than sit around indefinitely waiting to commit on the partition that broker is leader for, we could attempt to make progress on other tasks, and then try to commit the stuck one again later. Another issue is self-healing. Suppose that the broker cluster becomes unavailable. If we're running an application in which some or all of the threads will die from timeouts, then whenever the broker's operators _do_ bring it back up, we would have to actually restart the application to recover. Maybe this doesn't seem so bad, but it you're in a large organization operating 100 Streams apps, it sounds like a pretty big pain to me. Conversely, if Streams were to just log a warning each time it got a timeout, but continue trying to make progress, then we would know that there is something wrong (because we're monitoring the app for warnings), so we could alert the broker's operators. However, once the issue is resolved, our app will just automatically pick right back up where it left off. At a very high level, I'd propose the following failure handling protocol: 1. *retriable error*: A transient failure (like a timeout exception). We just put the task to the side, and try to do work for the next task. 2. *recoverable error*: A permanent failure (like getting fenced), than we can recover from by re-joining the cluster. We try to close and re-open the producer, or initiate a rebalance to re-join the cluster, depending on the exact nature of the failure. 3. *fatal error*: A permanent failure that requires human intervention to resolve. Something like discovering that we're in the middle of upgrading to an incompatible topology, or that the application itself is invalid for some other reason. Attempting to continue could result in data corruption, so we should just shut down the whole application so that someone can figure out what's wrong and fix it. > Gracefully handle timeout exceptions on Kafka Streams > ----------------------------------------------------- > > Key: KAFKA-9274 > URL: https://issues.apache.org/jira/browse/KAFKA-9274 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Boyang Chen > Priority: Major > > Right now streams don't treat timeout exception as retriable in general by > throwing it to the application level. If not handled by the user, this would > kill the stream thread unfortunately. > In fact, timeouts happen mostly due to network issue or server side > unavailability. Hard failure on client seems to be an over-kill. > We would like to discuss what's the best practice to handle timeout > exceptions on Streams. The current state is still brainstorming and > consolidate all the cases that contain timeout exception within this ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)