Re: Controlled shutdown failure, retry settings

Jason Rosenberg Fri, 25 Oct 2013 09:52:46 -0700

Neha,

It looks like the StateChangeLogMergerTool takes state change logs as
input.  I'm not sure I know where those live?  (Maybe update the doc on
that wiki page to describe!).


Thanks,

Jason


On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote:

> Jason,
>
> The state change log tool is described here -
>
> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool
>
> I'm curious what the IOException is and if we can improve error reporting.
> Can you send around the stack trace ?
>
> Thanks,
> Neha
>
>
> On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com> wrote:
>
> > Ok,
> >
> > Looking at the controlled shutdown code, it appears that it can fail with
> > an IOException too, in which case it won't report the remaining
> partitions
> > to replicate, etc.  (I think that might be what I'm seeing, since I never
> > saw the log line for "controlled shutdown failed, X remaining
> partitions",
> > etc.).  In my case, that may be the issue (it's happening during a
> rolling
> > restart, and the second of 3 nodes might be trying to shutdown before the
> > first one has completely come back up).
> >
> > I've heard you guys mention several times now about controller and state
> > change logs.  But I don't know where those live (or how to configure).
> >  Please advise!
> >
> > Thanks,
> >
> > Jason
> >
> >
> > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <neha.narkh...@gmail.com
> > >wrote:
> >
> > > Controlled shutdown can fail if the cluster has non zero under
> replicated
> > > partition count. Since that means the leaders may not move off of the
> > > broker being shutdown, causing controlled shutdown to fail. The backoff
> > > might help if the under replication is just temporary due to a spike in
> > > traffic. This is the most common reason it might fail besides bugs. But
> > you
> > > can check the logs to see why the shutdown failed.
> > >
> > > Thanks,
> > > Neha
> > > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote:
> > >
> > > > I'm running into an issue where sometimes, the controlled shutdown
> > fails
> > > to
> > > > complete after the default 3 retry attempts.  This ended up in one
> > case,
> > > > with a broker under going an unclean shutdown, and then it was in a
> > > rather
> > > > bad state after restart.  Producers would connect to the metadata
> vip,
> > > > still think that this broker was the leader, and then fail on that
> > > leader,
> > > > and then reconnect to to the metadata vip, and get sent back to that
> > same
> > > > failed broker!   Does that make sense?
> > > >
> > > > I'm trying to understand the conditions which cause the controlled
> > > shutdown
> > > > to fail?  There doesn't seem to be a setting for max amount of time
> to
> > > > wait, etc.
> > > >
> > > > It would be nice to specify how long to try before giving up
> (hopefully
> > > > giving up in a graceful way).
> > > >
> > > > Instead, we have a retry count, but it's not clear what this retry
> > count
> > > is
> > > > really specifying, in terms of how long to keep trying, etc.
> > > >
> > > > Also, what are the ramifications for different settings for the
> > > > controlled.shutdown.retry.backoff.ms?  Is there a reason we want to
> > wait
> > > > before retrying again (again, it would be helpful to understand the
> > > reasons
> > > > for a controlled shutdown failure).
> > > >
> > > > Thanks,
> > > >
> > > > Jason
> > > >
> > >
> >
>

Re: Controlled shutdown failure, retry settings

Reply via email to