[jira] [Commented] (KAFKA-10559) Don't shutdown the entire app upon TimeoutException during internal topic validation

Sophie Blee-Goldman (Jira) Mon, 05 Oct 2020 11:54:18 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208257#comment-17208257
 ]


Sophie Blee-Goldman commented on KAFKA-10559:
---------------------------------------------

[~sagarrao] Yeah, go ahead! This should be a pretty small PR so it would be 
great if we could knock it out in the next week or two. Just ping me when it's 
ready.

For the PR itself, I think it sounds reasonable to just rethrow the 
TimeoutException to kill the thread. The "add/recover stream thread" 
functionality will probably slip 2.7, but it'll be implemented soon. So we 
don't really need to go out of our way to save a single thread from death in 
rare circumstances imo

> Don't shutdown the entire app upon TimeoutException during internal topic 
> validation
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-10559
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10559
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Sophie Blee-Goldman
>            Priority: Blocker
>             Fix For: 2.7.0
>
>
> During some of the KIP-572 work, we made things pretty brittle by changing 
> the StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA` 
> error code and shut down the entire application if a TimeoutException is hit 
> during the internal topic creation/validation.
> Internal topic validation occurs during every rebalance, and we have seen it 
> time out on topic discovery in unstable environments. So shutting down the 
> entire application seems like a step in the wrong direction, and antithetical 
> to the goal of KIP-572 (improving the resiliency of Streams in the face of 
> TimeoutExceptions)
> I'm not totally sure what the previous behavior was, but it seems to me we 
> have three options:
>  # Rethrow the TimeoutException and allow it to kill the thread
>  # Swallow the TimeoutException and retry the rebalance indefinitely
>  # Some combination of the above: swallow the TimeoutException but don't 
> retry indefinitely:
>  ## Start a timer and allow retrying rebalances for up the configured 
> task.timeout.ms, the timeout config introduced in KIP-572
>  ## Retry for some constant number of rebalances
> I think if we go with option 3, then shutting down the entire application is 
> relatively more palatable, as we have given the environment a chance to 
> stabilize.
> But, killing the thread still seems preferable, given the two new features 
> that are coming out soon: the ability to start up new threads, and the 
> improved exception handler that allows the user to choose to shut down the 
> entire application if that's really what they want. Once users have this 
> level of control over the application, we should allow them to decide how 
> they want to handle exceptional cases like this, rather than forcing an 
> option on them (eg shutdown everything) 
>  
> Imo we should fix this before 2.7 comes out, even if it's just a partial fix 
> (eg we do option 1 in 2.7, but plan to implement option 3 eventually)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10559) Don't shutdown the entire app upon TimeoutException during internal topic validation

Reply via email to