Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-25 Thread Luke Steensen
Ok, I've reproduced this again and made sure to grab the broker logs before the instance are terminated. I posted a writeup with what seemed like the relevant bits of the logs here: https://gist.github.com/lukesteensen/793a467a058af51a7047 To summarize, it looks like Gwen was correct and the broke

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-14 Thread Luke Steensen
I don't have broker logs at the moment, but I'll work on getting some I can share. We are running 0.9.0.0 for both the brokers and producer in this case. I've pasted some bits from the producer log below in case that's helpful. Of particular note is how long it takes for the second disconnect to oc

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-14 Thread Gwen Shapira
Do you happen to have broker-logs and state-change logs from the controlled shutdown attempt? In theory, the producer should not really see a disconnect - it should get NotALeader exception (because leaders are re-assigned before the shutdown) that will cause it to get the metadata. I am guessing

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-13 Thread Luke Steensen
Yes, that was my intention and we have both of those configs turned on. For some reason, however, the controlled shutdown wasn't transferring leadership of all partitions, which caused the issues I described in my initial email. On Wed, Jan 13, 2016 at 12:05 AM, Ján Koščo <3k.stan...@gmail.com> w

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Ján Koščo
Not sure, but should combination of auto.leader.rebalance.enable=true and controlled.shutdown.enable=true sort this out for you? 2016-01-13 1:13 GMT+01:00 Scott Reynolds : > we use 0.9.0.0 and it is working fine. Not all the features work and a few > things make a few assumptions about how zookee

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Scott Reynolds
we use 0.9.0.0 and it is working fine. Not all the features work and a few things make a few assumptions about how zookeeper is used. But as a tool for provisioning, expanding and failure recovery it is working fine so far. *knocks on wood* On Tue, Jan 12, 2016 at 4:08 PM, Luke Steensen < luke.st

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Luke Steensen
Ah, that's a good idea. Do you know if kafka-manager works with kafka 0.9 by chance? That would be a nice improvement of the cli tools. Thanks, Luke On Tue, Jan 12, 2016 at 4:53 PM, Scott Reynolds wrote: > Luke, > > We practice the same immutable pattern on AWS. To decommission a broker, we >

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Scott Reynolds
Luke, We practice the same immutable pattern on AWS. To decommission a broker, we use partition reassignment first to move the partitions off of the node and preferred leadership election. To do this with a web ui, so that you can handle it on lizard brain at 3 am, we have the Yahoo Kafka Manager

Re: Controlled shutdown

2015-11-05 Thread Vadim Bobrov
Thanks, Prabhjot I know that running out of space on disks can cause a Kafka shutdown but it is not the case here, there is a lot of free space On Thu, Nov 5, 2015 at 6:08 AM, Prabhjot Bharaj wrote: > Hi Vadim, > > Did you see your hard disk partition getting full where kafka data > directory i

Re: Controlled shutdown

2015-11-05 Thread Vadim Bobrov
Hi Gleb, No, no zookeper related errors. The only suspicious lines I see immediately preceeding the shutdown is this: 2015-11-03 01:53:20,810] INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer) which makes me think it could be some networ

Re: Controlled shutdown

2015-11-05 Thread Prabhjot Bharaj
Hi Vadim, Did you see your hard disk partition getting full where kafka data directory is present ? It could be because you have set log retention to a larger value, whereas your input data may be taking up full disk space. In that case, move some data out from that disk partition, set log retenti

Re: Controlled shutdown

2015-11-05 Thread Gleb Zhukov
Hi, Vadim. Do you see something like this: "zookeeper state changed (Expired)" in kafka's logs? On Wed, Nov 4, 2015 at 6:33 PM, Vadim Bobrov wrote: > Hi, > > does anyone know in what cases Kafka will take itself down? I have a > cluster of 2 nodes that went down (not crashed) this night in a con

Re: Controlled Shutdown Tool?

2015-09-11 Thread Otis Gospodnetić
Btw. a regular UNIX kill will do the same - SIGTERM - http://linux.die.net/man/1/kill Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Mon, Jul 27, 2015 at 3:57 PM, Andrew Otto wrote: > Ah, thank you, SIGTERM

Re: Controlled Shutdown Tool?

2015-07-27 Thread Andrew Otto
Ah, thank you, SIGTERM is what I was looking for. The docs are unclear on that, it would be useful to fix those. Thanks! > On Jul 27, 2015, at 14:59, Binh Nguyen Van wrote: > > You can initiate controlled shutdown by run bin/kafka-server-stop.sh. This > will send a SIGTERM to broker to tell

Re: Controlled Shutdown Tool?

2015-07-27 Thread Binh Nguyen Van
You can initiate controlled shutdown by run bin/kafka-server-stop.sh. This will send a SIGTERM to broker to tell it to do the controlled shutdown. I also got confused before and had to look at the code to figure that out. I think it is better if we can add this to the document. -Binh On Mon, Jul

Re: Controlled Shutdown Tool?

2015-07-27 Thread Sriharsha Chintalapani
controlled.shutdown built into broker when this config set to true it makes request to controller to initiate the controlled shutdown, waits till the request is succeeded and incase of failure retries the shutdown   controlled.shutdown.max.retries times. https://github.com/apache/kafka/blob/0.8.

Re: Controlled Shutdown Tool?

2015-07-27 Thread Andrew Otto
Thanks! But how do I initiate a controlled shutdown on a running broker? Editing server.properties is not going to cause this to happen. Don’t I have to tell the broker to shutdown nicely? All I really want to do is tell the controller to move leadership to other replicas, so I can shutdown

Re: Controlled Shutdown Tool?

2015-07-27 Thread Sriharsha Chintalapani
You can set controlled.shutdown.enable to true in kafka’s server.properties  , this is enabled by default in 0.8.2 on wards and also you can set max retries using controlled.shutdown.max.retries defaults to 3 . Thanks, Harsha On July 27, 2015 at 11:42:32 AM, Andrew Otto (ao...@wikimedia.org)

Re: Controlled shutdown and leader election issues

2014-04-07 Thread Ryan Berdeen
I think I've figured it out, and it still happens in the 0.8.1 branch. The code that is responsible for deleting the key from ZooKeeper is broken and will never be called when using the command line tool, so it will fail after the first use. I''ve created https://issues.apache.org/jira/browse/KAFKA

Re: Controlled shutdown and leader election issues

2014-04-03 Thread Clark Breyman
Done. https://issues.apache.org/jira/browse/KAFKA-1360 On Thu, Apr 3, 2014 at 9:13 PM, Neha Narkhede wrote: > >> Is there a maven repo for pulling snapshot CI builds from? > > We still need to get the CI build setup going, could you please file a JIRA > for this? > Meanwhile, you will have to ju

Re: Controlled shutdown and leader election issues

2014-04-03 Thread Neha Narkhede
>> Is there a maven repo for pulling snapshot CI builds from? We still need to get the CI build setup going, could you please file a JIRA for this? Meanwhile, you will have to just build the code yourself for now, unfortunately. Thanks, Neha On Thu, Apr 3, 2014 at 12:01 PM, Clark Breyman wrote

Re: Controlled shutdown and leader election issues

2014-04-03 Thread Clark Breyman
Thank Neha - Is there a maven repo for pulling snapshot CI builds from? Sorry if this is answered elsewhere. On Wed, Apr 2, 2014 at 7:16 PM, Neha Narkhede wrote: > I'm not so sure if I know the issue you are running into but we fixed a few > bugs with similar symptoms and the fixes are on the 0.

Re: Controlled shutdown and leader election issues

2014-04-02 Thread Neha Narkhede
I'm not so sure if I know the issue you are running into but we fixed a few bugs with similar symptoms and the fixes are on the 0.8.1 branch. It will be great if you give it a try to see if your issue is resolved. Thanks, Neha On Wed, Apr 2, 2014 at 12:59 PM, Clark Breyman wrote: > Was there a

Re: Controlled shutdown and leader election issues

2014-04-02 Thread Clark Breyman
Was there an answer for 0.8.1 getting stuck in preferred leader election? I'm seeing this as well. Is there a JIRA ticket on this issue? On Fri, Mar 21, 2014 at 1:15 PM, Ryan Berdeen wrote: > So, for 0.8 without "controlled.shutdown.enable", why would ShutdownBroker > and restarting cause under

Re: Controlled shutdown and leader election issues

2014-03-21 Thread Ryan Berdeen
So, for 0.8 without "controlled.shutdown.enable", why would ShutdownBroker and restarting cause under-replication and producer exceptions? How can I upgrade gracefully? What's up with 0.8.1 getting stuck in preferred leader election? On Fri, Mar 21, 2014 at 12:18 AM, Neha Narkhede wrote: > Whic

Re: Controlled shutdown and leader election issues

2014-03-20 Thread Neha Narkhede
Which brings up the question - Do we need ShutdownBroker anymore? It seems like the config should handle controlled shutdown correctly anyway. Thanks, Neha On Thu, Mar 20, 2014 at 9:16 PM, Jun Rao wrote: > We haven't been testing the ShutdownBroker command in 0.8.1 rigorously > since in 0.8.1,

Re: Controlled shutdown and leader election issues

2014-03-20 Thread Jun Rao
We haven't been testing the ShutdownBroker command in 0.8.1 rigorously since in 0.8.1, one can do the controlled shutdown through the new config "controlled.shutdown.enable". Instead of running the ShutdownBroker command during the upgrade, you can also wait until under replicated partition count d

Re: Controlled shutdown and "UnderReplicatedPartitions" state?

2013-12-02 Thread Guozhang Wang
Using "controlled shut down" means Kafka will try first to migrate the partition leaderships from the broker being shut down before really shut it down so that the partitions will not be unavailable. Disabling it would mean that during the time when the broker is done until the controller noticed i

Re: Controlled shutdown failure, retry settings

2013-11-03 Thread Jun Rao
A replica is dropped out of ISR if (1) it hasn't issue a fetch request for some time, or (2) it's behind the leader by some messages. The replica will be added back to ISR if neither condition is longer true. The actual value depends on the application. For example, if there is a spike and the fol

Re: Controlled shutdown failure, retry settings

2013-11-03 Thread Jason Rosenberg
Jun, Can you explain the difference between "failed" and "slow"? In either case, the follower drops out of the ISR, and can come back later if they catch up, no? In the configuration doc, it seems to describe them both with the same language: "if ., the leader will remove the follower from

Re: Controlled shutdown failure, retry settings

2013-11-02 Thread Jun Rao
replica.lag.time.max.ms is used to detect a failed broker. replica.lag.max.messages is used to detect a slow broker. Thanks, Jun On Fri, Nov 1, 2013 at 10:36 PM, Jason Rosenberg wrote: > In response to Joel's point, I think I do understand that messages can be > lost, if in fact we have dropp

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Jason Rosenberg
In response to Joel's point, I think I do understand that messages can be lost, if in fact we have dropped down to only 1 member in the ISR at the time the message is written, and then that 1 node goes down. What I'm not clear on, is the conditions under which a node can drop out of the ISR. You

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Neha Narkhede
For supporting more durability at the expense of availability, we have a JIRA that we will fix on trunk. This will allow you to configure the default as well as per topic durability vs availability behavior - https://issues.apache.org/jira/browse/KAFKA-1028 Thanks, Neha On Fri, Nov 1, 2013 at 1

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Joel Koshy
Unclean shutdown could result in data loss - since you are moving leadership to a replica that has fallen out of ISR. i.e., it's log end offset is behind the last committed message to this partition. >>> But if data is written with 'request.required.acks=-1', no data s

Re: Controlled shutdown failure, retry settings

2013-10-29 Thread Jason Rosenberg
I've filed: https://issues.apache.org/jira/browse/KAFKA-1108 On Tue, Oct 29, 2013 at 4:29 PM, Jason Rosenberg wrote: > Here's another exception I see during controlled shutdown (this time there > was not an unclean shutdown problem). Should I be concerned about this > exception? Is any data lo

Re: Controlled shutdown failure, retry settings

2013-10-29 Thread Jason Rosenberg
Here's another exception I see during controlled shutdown (this time there was not an unclean shutdown problem). Should I be concerned about this exception? Is any data loss possible with this? This one happened after the first "Retrying controlled shutdown after the previous attempt failed..." m

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
On Fri, Oct 25, 2013 at 9:16 PM, Joel Koshy wrote: > > Unclean shutdown could result in data loss - since you are moving > leadership to a replica that has fallen out of ISR. i.e., it's log end > offset is behind the last committed message to this partition. > > But if data is written with 'reque

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Joel Koshy
On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg wrote: > It looks like when the controlled shutdown failes with an IOException, the > exception is swallowed, and we see nothing in the logs: > > catch { > case ioe: java.io.IOException => > channel.disconne

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
It looks like when the controlled shutdown failes with an IOException, the exception is swallowed, and we see nothing in the logs: catch { case ioe: java.io.IOException => channel.disconnect() channel = null // ignore and tr

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
Neha, It looks like the StateChangeLogMergerTool takes state change logs as input. I'm not sure I know where those live? (Maybe update the doc on that wiki page to describe!). Thanks, Jason On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede wrote: > Jason, > > The state change log tool is desc

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Neha Narkhede
Jason, The state change log tool is described here - https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool I'm curious what the IOException is and if we can improve error reporting. Can you send around the stack trace ? Thanks, Neha On

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
Ok, Looking at the controlled shutdown code, it appears that it can fail with an IOException too, in which case it won't report the remaining partitions to replicate, etc. (I think that might be what I'm seeing, since I never saw the log line for "controlled shutdown failed, X remaining partition

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Neha Narkhede
Controlled shutdown can fail if the cluster has non zero under replicated partition count. Since that means the leaders may not move off of the broker being shutdown, causing controlled shutdown to fail. The backoff might help if the under replication is just temporary due to a spike in traffic. Th

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Joel Koshy
On Fri, Oct 25, 2013 at 1:18 AM, Jason Rosenberg wrote: > I'm running into an issue where sometimes, the controlled shutdown fails to > complete after the default 3 retry attempts. This ended up in one case, > with a broker under going an unclean shutdown, and then it was in a rather > bad state

Re: controlled shutdown

2013-08-14 Thread Joel Koshy
You can send it a SIGTERM signal (SIGKILL won't work). Thanks, Joel On Wed, Aug 14, 2013 at 8:05 AM, Yu, Libo wrote: > Hi, > > In this link > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools > two ways to do controlled shutdown are mentioned. "The first approach is to > set