Re: Controlled shutdown failure, retry settings

Neha Narkhede Fri, 25 Oct 2013 09:39:48 -0700

Jason,

The state change log tool is described here -
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool


I'm curious what the IOException is and if we can improve error reporting.
Can you send around the stack trace ?

Thanks,
Neha


On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com> wrote:

> Ok,
>
> Looking at the controlled shutdown code, it appears that it can fail with
> an IOException too, in which case it won't report the remaining partitions
> to replicate, etc.  (I think that might be what I'm seeing, since I never
> saw the log line for "controlled shutdown failed, X remaining partitions",
> etc.).  In my case, that may be the issue (it's happening during a rolling
> restart, and the second of 3 nodes might be trying to shutdown before the
> first one has completely come back up).
>
> I've heard you guys mention several times now about controller and state
> change logs.  But I don't know where those live (or how to configure).
>  Please advise!
>
> Thanks,
>
> Jason
>
>
> On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <neha.narkh...@gmail.com
> >wrote:
>
> > Controlled shutdown can fail if the cluster has non zero under replicated
> > partition count. Since that means the leaders may not move off of the
> > broker being shutdown, causing controlled shutdown to fail. The backoff
> > might help if the under replication is just temporary due to a spike in
> > traffic. This is the most common reason it might fail besides bugs. But
> you
> > can check the logs to see why the shutdown failed.
> >
> > Thanks,
> > Neha
> > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote:
> >
> > > I'm running into an issue where sometimes, the controlled shutdown
> fails
> > to
> > > complete after the default 3 retry attempts.  This ended up in one
> case,
> > > with a broker under going an unclean shutdown, and then it was in a
> > rather
> > > bad state after restart.  Producers would connect to the metadata vip,
> > > still think that this broker was the leader, and then fail on that
> > leader,
> > > and then reconnect to to the metadata vip, and get sent back to that
> same
> > > failed broker!   Does that make sense?
> > >
> > > I'm trying to understand the conditions which cause the controlled
> > shutdown
> > > to fail?  There doesn't seem to be a setting for max amount of time to
> > > wait, etc.
> > >
> > > It would be nice to specify how long to try before giving up (hopefully
> > > giving up in a graceful way).
> > >
> > > Instead, we have a retry count, but it's not clear what this retry
> count
> > is
> > > really specifying, in terms of how long to keep trying, etc.
> > >
> > > Also, what are the ramifications for different settings for the
> > > controlled.shutdown.retry.backoff.ms?  Is there a reason we want to
> wait
> > > before retrying again (again, it would be helpful to understand the
> > reasons
> > > for a controlled shutdown failure).
> > >
> > > Thanks,
> > >
> > > Jason
> > >
> >
>

Re: Controlled shutdown failure, retry settings

Reply via email to