[ https://issues.apache.org/jira/browse/KAFKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174222#comment-14174222 ]
Ewen Cheslack-Postava commented on KAFKA-1108: ---------------------------------------------- Updated reviewboard https://reviews.apache.org/r/26770/diff/ against branch origin/trunk > when controlled shutdown attempt fails, the reason is not always logged > ----------------------------------------------------------------------- > > Key: KAFKA-1108 > URL: https://issues.apache.org/jira/browse/KAFKA-1108 > Project: Kafka > Issue Type: Bug > Reporter: Jason Rosenberg > Assignee: Ewen Cheslack-Postava > Labels: newbie > Fix For: 0.9.0 > > Attachments: KAFKA-1108.patch, KAFKA-1108_2014-10-16_13:53:11.patch > > > In KafkaServer.controlledShutdown(), it initiates a controlled shutdown, and > then if there's a failure, it will retry the controlledShutdown. > Looking at the code, there are 2 ways a retry could fail, one with an error > response from the controller, and this messaging code: > {code} > info("Remaining partitions to move: > %s".format(shutdownResponse.partitionsRemaining.mkString(","))) > info("Error code from controller: %d".format(shutdownResponse.errorCode)) > {code} > Alternatively, there could be an IOException, with this code executed: > {code} > catch { > case ioe: java.io.IOException => > channel.disconnect() > channel = null > // ignore and try again > } > {code} > And then finally, in either case: > {code} > if (!shutdownSuceeded) { > Thread.sleep(config.controlledShutdownRetryBackoffMs) > warn("Retrying controlled shutdown after the previous attempt > failed...") > } > {code} > It would be nice if the nature of the IOException were logged in either case > (I'd be happy with an ioe.getMessage() instead of a full stack trace, as > kafka in general tends to be too willing to dump IOException stack traces!). > I suspect, in my case, the actual IOException is a socket timeout (as the > time between initial "Starting controlled shutdown...." and the first > "Retrying..." message is usually about 35 seconds (the socket timeout + the > controlled shutdown retry backoff). So, it would seem that really, the issue > in this case is that controlled shutdown is taking too long. It would seem > sensible instead to have the controller report back to the server (before the > socket timeout) that more time is needed, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)