[
https://issues.apache.org/jira/browse/KAFKA-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208257#comment-17208257
]
Sophie Blee-Goldman commented on KAFKA-10559:
---------------------------------------------
[~sagarrao] Yeah, go ahead! This should be a pretty small PR so it would be
great if we could knock it out in the next week or two. Just ping me when it's
ready.
For the PR itself, I think it sounds reasonable to just rethrow the
TimeoutException to kill the thread. The "add/recover stream thread"
functionality will probably slip 2.7, but it'll be implemented soon. So we
don't really need to go out of our way to save a single thread from death in
rare circumstances imo
> Don't shutdown the entire app upon TimeoutException during internal topic
> validation
> ------------------------------------------------------------------------------------
>
> Key: KAFKA-10559
> URL: https://issues.apache.org/jira/browse/KAFKA-10559
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Reporter: Sophie Blee-Goldman
> Priority: Blocker
> Fix For: 2.7.0
>
>
> During some of the KIP-572 work, we made things pretty brittle by changing
> the StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA`
> error code and shut down the entire application if a TimeoutException is hit
> during the internal topic creation/validation.
> Internal topic validation occurs during every rebalance, and we have seen it
> time out on topic discovery in unstable environments. So shutting down the
> entire application seems like a step in the wrong direction, and antithetical
> to the goal of KIP-572 (improving the resiliency of Streams in the face of
> TimeoutExceptions)
> I'm not totally sure what the previous behavior was, but it seems to me we
> have three options:
> # Rethrow the TimeoutException and allow it to kill the thread
> # Swallow the TimeoutException and retry the rebalance indefinitely
> # Some combination of the above: swallow the TimeoutException but don't
> retry indefinitely:
> ## Start a timer and allow retrying rebalances for up the configured
> task.timeout.ms, the timeout config introduced in KIP-572
> ## Retry for some constant number of rebalances
> I think if we go with option 3, then shutting down the entire application is
> relatively more palatable, as we have given the environment a chance to
> stabilize.
> But, killing the thread still seems preferable, given the two new features
> that are coming out soon: the ability to start up new threads, and the
> improved exception handler that allows the user to choose to shut down the
> entire application if that's really what they want. Once users have this
> level of control over the application, we should allow them to decide how
> they want to handle exceptional cases like this, rather than forcing an
> option on them (eg shutdown everything)
>
> Imo we should fix this before 2.7 comes out, even if it's just a partial fix
> (eg we do option 1 in 2.7, but plan to implement option 3 eventually)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)