Re: Controlled shutdown failure, retry settings

Joel Koshy Fri, 25 Oct 2013 18:17:38 -0700

On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg <j...@squareup.com> wrote:
> It looks like when the controlled shutdown failes with an IOException, the
> exception is swallowed, and we see nothing in the logs:
>
>             catch {
>               case ioe: java.io.IOException =>
>                 channel.disconnect()
>                 channel = null
>                 // ignore and try again
>             }


> Question.....what is the ramification of an 'unclean shutdown'?  Is it no
> different than a shutdown with no controlled shutdown ever attempted?  Or
> is it something more difficult to recover from?

Unclean shutdown could result in data loss - since you are moving
leadership to a replica that has fallen out of ISR. i.e., it's log end
offset is behind the last committed message to this partition.

>
> I am still not clear on how to generate the state transition logs.....Does
> the StateChangeLogManagerTool run against the main logs for the server (and
> just collates entries there)?

Take a look at the packaged log4j.properties file. The controller's
partition/replica state machines and its channel manager which
sends/receives leaderandisr requests/responses to brokers uses a
stateChangeLogger. The replica managers on all brokers also use this
logger.

>
>
> This one eventually succeeds, after a mysterious failure:
>
> 2013-10-25 00:11:53,891  INFO [Thread-13] server.KafkaServer - [Kafka
> Server 10], Starting controlled shutdown
> ....<no exceptions between these log lines><no "Remaining partitions to
> move....">....
> 2013-10-25 00:12:28,965  WARN [Thread-13] server.KafkaServer - [Kafka
> Server 10], Retrying controlled shutdown after the previous attempt
> failed...

Our logging can improve - e.g., it looks like on controller movement
we could retry without saying why.


Thanks,

Joel

>
>
>
> On Fri, Oct 25, 2013 at 12:51 PM, Jason Rosenberg <j...@squareup.com> wrote:
>
>> Neha,
>>
>> It looks like the StateChangeLogMergerTool takes state change logs as
>> input.  I'm not sure I know where those live?  (Maybe update the doc on
>> that wiki page to describe!).
>>
>> Thanks,
>>
>> Jason
>>
>>
>> On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede 
>> <neha.narkh...@gmail.com>wrote:
>>
>>> Jason,
>>>
>>> The state change log tool is described here -
>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool
>>>
>>> I'm curious what the IOException is and if we can improve error reporting.
>>> Can you send around the stack trace ?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com>
>>> wrote:
>>>
>>> > Ok,
>>> >
>>> > Looking at the controlled shutdown code, it appears that it can fail
>>> with
>>> > an IOException too, in which case it won't report the remaining
>>> partitions
>>> > to replicate, etc.  (I think that might be what I'm seeing, since I
>>> never
>>> > saw the log line for "controlled shutdown failed, X remaining
>>> partitions",
>>> > etc.).  In my case, that may be the issue (it's happening during a
>>> rolling
>>> > restart, and the second of 3 nodes might be trying to shutdown before
>>> the
>>> > first one has completely come back up).
>>> >
>>> > I've heard you guys mention several times now about controller and state
>>> > change logs.  But I don't know where those live (or how to configure).
>>> >  Please advise!
>>> >
>>> > Thanks,
>>> >
>>> > Jason
>>> >
>>> >
>>> > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <
>>> neha.narkh...@gmail.com
>>> > >wrote:
>>> >
>>> > > Controlled shutdown can fail if the cluster has non zero under
>>> replicated
>>> > > partition count. Since that means the leaders may not move off of the
>>> > > broker being shutdown, causing controlled shutdown to fail. The
>>> backoff
>>> > > might help if the under replication is just temporary due to a spike
>>> in
>>> > > traffic. This is the most common reason it might fail besides bugs.
>>> But
>>> > you
>>> > > can check the logs to see why the shutdown failed.
>>> > >
>>> > > Thanks,
>>> > > Neha
>>> > > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote:
>>> > >
>>> > > > I'm running into an issue where sometimes, the controlled shutdown
>>> > fails
>>> > > to
>>> > > > complete after the default 3 retry attempts.  This ended up in one
>>> > case,
>>> > > > with a broker under going an unclean shutdown, and then it was in a
>>> > > rather
>>> > > > bad state after restart.  Producers would connect to the metadata
>>> vip,
>>> > > > still think that this broker was the leader, and then fail on that
>>> > > leader,
>>> > > > and then reconnect to to the metadata vip, and get sent back to that
>>> > same
>>> > > > failed broker!   Does that make sense?
>>> > > >
>>> > > > I'm trying to understand the conditions which cause the controlled
>>> > > shutdown
>>> > > > to fail?  There doesn't seem to be a setting for max amount of time
>>> to
>>> > > > wait, etc.
>>> > > >
>>> > > > It would be nice to specify how long to try before giving up
>>> (hopefully
>>> > > > giving up in a graceful way).
>>> > > >
>>> > > > Instead, we have a retry count, but it's not clear what this retry
>>> > count
>>> > > is
>>> > > > really specifying, in terms of how long to keep trying, etc.
>>> > > >
>>> > > > Also, what are the ramifications for different settings for the
>>> > > > controlled.shutdown.retry.backoff.ms?  Is there a reason we want to
>>> > wait
>>> > > > before retrying again (again, it would be helpful to understand the
>>> > > reasons
>>> > > > for a controlled shutdown failure).
>>> > > >
>>> > > > Thanks,
>>> > > >
>>> > > > Jason
>>> > > >
>>> > >
>>> >
>>>
>>
>>

Re: Controlled shutdown failure, retry settings

Reply via email to