Re: Controlled shutdown failure, retry settings

2013-11-03 Thread Jun Rao
A replica is dropped out of ISR if (1) it hasn't issue a fetch request for some time, or (2) it's behind the leader by some messages. The replica will be added back to ISR if neither condition is longer true. The actual value depends on the application. For example, if there is a spike and the fol

Re: Controlled shutdown failure, retry settings

2013-11-03 Thread Jason Rosenberg
Jun, Can you explain the difference between "failed" and "slow"? In either case, the follower drops out of the ISR, and can come back later if they catch up, no? In the configuration doc, it seems to describe them both with the same language: "if ., the leader will remove the follower from

Re: Controlled shutdown failure, retry settings

2013-11-02 Thread Jun Rao
replica.lag.time.max.ms is used to detect a failed broker. replica.lag.max.messages is used to detect a slow broker. Thanks, Jun On Fri, Nov 1, 2013 at 10:36 PM, Jason Rosenberg wrote: > In response to Joel's point, I think I do understand that messages can be > lost, if in fact we have dropp

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Jason Rosenberg
In response to Joel's point, I think I do understand that messages can be lost, if in fact we have dropped down to only 1 member in the ISR at the time the message is written, and then that 1 node goes down. What I'm not clear on, is the conditions under which a node can drop out of the ISR. You

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Neha Narkhede
For supporting more durability at the expense of availability, we have a JIRA that we will fix on trunk. This will allow you to configure the default as well as per topic durability vs availability behavior - https://issues.apache.org/jira/browse/KAFKA-1028 Thanks, Neha On Fri, Nov 1, 2013 at 1

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Joel Koshy
Unclean shutdown could result in data loss - since you are moving leadership to a replica that has fallen out of ISR. i.e., it's log end offset is behind the last committed message to this partition. >>> But if data is written with 'request.required.acks=-1', no data s

Re: Controlled shutdown failure, retry settings

2013-10-29 Thread Jason Rosenberg
I've filed: https://issues.apache.org/jira/browse/KAFKA-1108 On Tue, Oct 29, 2013 at 4:29 PM, Jason Rosenberg wrote: > Here's another exception I see during controlled shutdown (this time there > was not an unclean shutdown problem). Should I be concerned about this > exception? Is any data lo

Re: Controlled shutdown failure, retry settings

2013-10-29 Thread Jason Rosenberg
Here's another exception I see during controlled shutdown (this time there was not an unclean shutdown problem). Should I be concerned about this exception? Is any data loss possible with this? This one happened after the first "Retrying controlled shutdown after the previous attempt failed..." m

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
On Fri, Oct 25, 2013 at 9:16 PM, Joel Koshy wrote: > > Unclean shutdown could result in data loss - since you are moving > leadership to a replica that has fallen out of ISR. i.e., it's log end > offset is behind the last committed message to this partition. > > But if data is written with 'reque

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Joel Koshy
On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg wrote: > It looks like when the controlled shutdown failes with an IOException, the > exception is swallowed, and we see nothing in the logs: > > catch { > case ioe: java.io.IOException => > channel.disconne

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
It looks like when the controlled shutdown failes with an IOException, the exception is swallowed, and we see nothing in the logs: catch { case ioe: java.io.IOException => channel.disconnect() channel = null // ignore and tr

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
Neha, It looks like the StateChangeLogMergerTool takes state change logs as input. I'm not sure I know where those live? (Maybe update the doc on that wiki page to describe!). Thanks, Jason On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede wrote: > Jason, > > The state change log tool is desc

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Neha Narkhede
Jason, The state change log tool is described here - https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool I'm curious what the IOException is and if we can improve error reporting. Can you send around the stack trace ? Thanks, Neha On

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
Ok, Looking at the controlled shutdown code, it appears that it can fail with an IOException too, in which case it won't report the remaining partitions to replicate, etc. (I think that might be what I'm seeing, since I never saw the log line for "controlled shutdown failed, X remaining partition

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Neha Narkhede
Controlled shutdown can fail if the cluster has non zero under replicated partition count. Since that means the leaders may not move off of the broker being shutdown, causing controlled shutdown to fail. The backoff might help if the under replication is just temporary due to a spike in traffic. Th

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Joel Koshy
On Fri, Oct 25, 2013 at 1:18 AM, Jason Rosenberg wrote: > I'm running into an issue where sometimes, the controlled shutdown fails to > complete after the default 3 retry attempts. This ended up in one case, > with a broker under going an unclean shutdown, and then it was in a rather > bad state