[ 
https://issues.apache.org/jira/browse/KAFKA-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989195#comment-16989195
 ] 

John Roesler commented on KAFKA-9274:
-------------------------------------

Thanks for starting this, [~bchen225242]!

One thing we should document is that the current approach is actually 
intentional. Streams essentially delegates the handling of transient failures 
into the clients (Consumer and Producer and Admin). Accordingly, the current 
advice is to set the client timeouts high enough to tolerate transient outages.

However, I agree with you that this is not the best approach. Setting the 
timeout high enough to be resilient to most network outages has the perverse 
effect of allowing StreamThreads to get hung up on an individual operation for 
that amount of time, harming the overall ability of the system to make progress 
on all inputs. For example, what if only one broker is unhealthy/unavailable? 
Rather than sit around indefinitely waiting to commit on the partition that 
broker is leader for, we could attempt to make progress on other tasks, and 
then try to commit the stuck one again later.

Another issue is self-healing. Suppose that the broker cluster becomes 
unavailable. If we're running an application in which some or all of the 
threads will die from timeouts, then whenever the broker's operators _do_ bring 
it back up, we would have to actually restart the application to recover. Maybe 
this doesn't seem so bad, but it you're in a large organization operating 100 
Streams apps, it sounds like a pretty big pain to me.

Conversely, if Streams were to just log a warning each time it got a timeout, 
but continue trying to make progress, then we would know that there is 
something wrong (because we're monitoring the app for warnings), so we could 
alert the broker's operators. However, once the issue is resolved, our app will 
just automatically pick right back up where it left off.

At a very high level, I'd propose the following failure handling protocol:
1. *retriable error*: A transient failure (like a timeout exception). We just 
put the task to the side, and try to do work for the next task.
2. *recoverable error*: A permanent failure (like getting fenced), than we can 
recover from by re-joining the cluster. We try to close and re-open the 
producer, or initiate a rebalance to re-join the cluster, depending on the 
exact nature of the failure.
3. *fatal error*: A permanent failure that requires human intervention to 
resolve. Something like discovering that we're in the middle of upgrading to an 
incompatible topology, or that the application itself is invalid for some other 
reason. Attempting to continue could result in data corruption, so we should 
just shut down the whole application so that someone can figure out what's 
wrong and fix it.

> Gracefully handle timeout exceptions on Kafka Streams
> -----------------------------------------------------
>
>                 Key: KAFKA-9274
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9274
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Boyang Chen
>            Priority: Major
>
> Right now streams don't treat timeout exception as retriable in general by 
> throwing it to the application level. If not handled by the user, this would 
> kill the stream thread unfortunately.
> In fact, timeouts happen mostly due to network issue or server side 
> unavailability. Hard failure on client seems to be an over-kill.
> We would like to discuss what's the best practice to handle timeout 
> exceptions on Streams. The current state is still brainstorming and 
> consolidate all the cases that contain timeout exception within this ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to