Kafka controlled shutdown hangs when there are large number of topics in the cluster

2016-12-19 Thread Robin, Martin (Nokia - IN/Bangalore)
While we do this kafka controlled shutdown hangs This same issue was not seen with 25 topics. Please let us know if any solution is known to this issue Thanks Martin

What happens if controlled shutdown can't complete within controlled.shutdown.max.retries attempts?

2016-03-20 Thread James Cheng
The broker has the following parameters related to controlled shutdown: controlled.shutdown.enable Enable controlled shutdown of the server boolean truemedium controlled.shutdown.max.retries Controlled shutdown can fail for multiple reasons. This determines the number of

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-25 Thread Luke Steensen
ents.producer.internals.Sender.run(Sender.java:128) > [vault_deploy.jar:na] > #011at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66] > DEBUG [2016-01-05 21:49:18,159] [] o.apache.kafka.clients.NetworkClient - > Node 0 disconnected. > > > On Thu, Jan 14, 2016 at 10:47 AM, Gwe

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-14 Thread Luke Steensen
s and state-change logs from the controlled > shutdown attempt? > > In theory, the producer should not really see a disconnect - it should get > NotALeader exception (because leaders are re-assigned before the shutdown) > that will cause it to get the metadata. I am guessing that leade

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-14 Thread Gwen Shapira
Do you happen to have broker-logs and state-change logs from the controlled shutdown attempt? In theory, the producer should not really see a disconnect - it should get NotALeader exception (because leaders are re-assigned before the shutdown) that will cause it to get the metadata. I am guessing

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-13 Thread Luke Steensen
Yes, that was my intention and we have both of those configs turned on. For some reason, however, the controlled shutdown wasn't transferring leadership of all partitions, which caused the issues I described in my initial email. On Wed, Jan 13, 2016 at 12:05 AM, Ján Koščo <3k.stan...@g

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Ján Koščo
t; > > luke.steen...@braintreepayments.com> wrote: > > > > > > > Hello, > > > > > > > > We've run into a bit of a head-scratcher with a new kafka deployment > > and > > > > I'm curious if anyone has any ideas. > &g

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Scott Reynolds
> > > > I was able to work around the issue by waiting 60 seconds between > > shutting > > > down the broker and terminating that instance, as well as raising > > > request.timeout.ms on the producer to 2x our zookeeper timeout. This > > gives > > > the broker a much quicker "connection refused" error instead of the > > > connection timeout and seems to give enough time for normal failure > > > detection and leader election to kick in before requests are timed out. > > > > > > So two questions really: (1) are there any known issues that would > cause > > a > > > controlled shutdown to fail to release leadership of all partitions? > and > > > (2) should the producer be timing out connection attempts more > > proactively? > > > > > > Thanks, > > > Luke > > > > > >

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Luke Steensen
e has any ideas. > > > > A little bit of background: this deployment uses "immutable > infrastructure" > > on AWS, so instead of configuring the host in-place, we stop the broker, > > tear down the instance, and replace it wholesale. My understanding was > t

Re: Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Scott Reynolds
anyone has any ideas. > > A little bit of background: this deployment uses "immutable infrastructure" > on AWS, so instead of configuring the host in-place, we stop the broker, > tear down the instance, and replace it wholesale. My understanding was that > controlled shu

Controlled shutdown not relinquishing leadership of all partitions

2016-01-12 Thread Luke Steensen
wn the instance, and replace it wholesale. My understanding was that controlled shutdown combined with producer retries would allow this operation to be zero-downtime. Unfortunately, things aren't working quite as I expected. After poring over the logs, I pieced together to following ch

Re: Controlled shutdown

2015-11-05 Thread Vadim Bobrov
Thanks, Prabhjot I know that running out of space on disks can cause a Kafka shutdown but it is not the case here, there is a lot of free space On Thu, Nov 5, 2015 at 6:08 AM, Prabhjot Bharaj wrote: > Hi Vadim, > > Did you see your hard disk partition getting full where kafka data > directory i

Re: Controlled shutdown

2015-11-05 Thread Vadim Bobrov
Hi Gleb, No, no zookeper related errors. The only suspicious lines I see immediately preceeding the shutdown is this: 2015-11-03 01:53:20,810] INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer) which makes me think it could be some networ

Re: Controlled shutdown

2015-11-05 Thread Prabhjot Bharaj
Hi Vadim, Did you see your hard disk partition getting full where kafka data directory is present ? It could be because you have set log retention to a larger value, whereas your input data may be taking up full disk space. In that case, move some data out from that disk partition, set log retenti

Re: Controlled shutdown

2015-11-05 Thread Gleb Zhukov
Hi, Vadim. Do you see something like this: "zookeeper state changed (Expired)" in kafka's logs? On Wed, Nov 4, 2015 at 6:33 PM, Vadim Bobrov wrote: > Hi, > > does anyone know in what cases Kafka will take itself down? I have a > cluster of 2 nodes that went down (not crashed) this night in a con

Controlled shutdown

2015-11-04 Thread Vadim Bobrov
Hi, does anyone know in what cases Kafka will take itself down? I have a cluster of 2 nodes that went down (not crashed) this night in a controlled and orderly shutdown as far as I can tell, except it wasn't controlled by me Thanks Vadim

Re: High delay during controlled shutdown and acks=-1

2015-11-02 Thread Becket Qin
Hi Federico, What is your replica.lag.time.max.ms? When acks=-1, the ProducerResponse won't return until all the broker in ISR get the message. During controlled shutdown, the shutting down broker is doing a lot of leader migration and could slow down on fetching data. The broker won't

High delay during controlled shutdown and acks=-1

2015-11-02 Thread Federico Giraud
Hi, I have few java async producers sending data to a 4-node Kafka cluster version 0.8.2, containing few thousand topics, all with replication factor 2. When i use acks=1 and trigger a controlled shutdown + restart on one broker, the producers will send data to the new leader, reporting a very

Re: Controlled Shutdown Tool?

2015-09-11 Thread Otis Gospodnetić
SIGTERM is what I was looking for. The docs are unclear on > that, it would be useful to fix those. Thanks! > > > > On Jul 27, 2015, at 14:59, Binh Nguyen Van wrote: > > > > You can initiate controlled shutdown by run bin/kafka-server-stop.sh. > This > > wil

Re: Controlled Shutdown Tool?

2015-07-27 Thread Andrew Otto
Ah, thank you, SIGTERM is what I was looking for. The docs are unclear on that, it would be useful to fix those. Thanks! > On Jul 27, 2015, at 14:59, Binh Nguyen Van wrote: > > You can initiate controlled shutdown by run bin/kafka-server-stop.sh. This > will send a SIGTERM to br

Re: Controlled Shutdown Tool?

2015-07-27 Thread Binh Nguyen Van
You can initiate controlled shutdown by run bin/kafka-server-stop.sh. This will send a SIGTERM to broker to tell it to do the controlled shutdown. I also got confused before and had to look at the code to figure that out. I think it is better if we can add this to the document. -Binh On Mon, Jul

Re: Controlled Shutdown Tool?

2015-07-27 Thread Sriharsha Chintalapani
controlled.shutdown built into broker when this config set to true it makes request to controller to initiate the controlled shutdown, waits till the request is succeeded and incase of failure retries the shutdown   controlled.shutdown.max.retries times. https://github.com/apache/kafka/blob

Re: Controlled Shutdown Tool?

2015-07-27 Thread Andrew Otto
Thanks! But how do I initiate a controlled shutdown on a running broker? Editing server.properties is not going to cause this to happen. Don’t I have to tell the broker to shutdown nicely? All I really want to do is tell the controller to move leadership to other replicas, so I can shutdown

Re: Controlled Shutdown Tool?

2015-07-27 Thread Sriharsha Chintalapani
You can set controlled.shutdown.enable to true in kafka’s server.properties  , this is enabled by default in 0.8.2 on wards and also you can set max retries using controlled.shutdown.max.retries defaults to 3 . Thanks, Harsha On July 27, 2015 at 11:42:32 AM, Andrew Otto (ao...@wikimedia.org)

Controlled Shutdown Tool?

2015-07-27 Thread Andrew Otto
I’m working on packaging 0.8.2.1 for Wikimedia, and in doing so I’ve noticed that kafka.admin.ShutdownBroker doesn’t exist anymore. From what I can tell, this has been intentionally removed in favor of a JMX(?) config “controlled.shutdown.enable”. It is unclear from the documentation how one i

Re: Interrupting controlled shutdown breaks Kafka cluster

2014-11-10 Thread Solon Gordon
gt; > Solon, > > > > > > Which version of Kafka are you running and are you enabling auto leader > > > rebalance at the same time? > > > > > > Guozhang > > > > > > On Fri, Nov 7, 2014 at 8:41 AM, Solon Gordon > wrote: > > >

Re: Interrupting controlled shutdown breaks Kafka cluster

2014-11-09 Thread Guozhang Wang
at 8:41 AM, Solon Gordon wrote: > > > > > Hi all, > > > > > > My team has observed that if a broker process is killed in the middle > of > > > the controlled shutdown procedure, the remaining brokers start spewing > > > errors and do not prop

Re: Interrupting controlled shutdown breaks Kafka cluster

2014-11-09 Thread Neha Narkhede
We fixed a couple issues related to automatic leader balancing and controlled shutdown. Would you mind trying out 0.8.2-beta? On Fri, Nov 7, 2014 at 11:52 AM, Solon Gordon wrote: > We're using 0.8.1.1 with auto.leader.rebalance.enable=true. > > On Fri, Nov 7, 2014 at 2:35 PM,

Re: Interrupting controlled shutdown breaks Kafka cluster

2014-11-07 Thread Solon Gordon
2014 at 8:41 AM, Solon Gordon wrote: > > > Hi all, > > > > My team has observed that if a broker process is killed in the middle of > > the controlled shutdown procedure, the remaining brokers start spewing > > errors and do not properly rebalance leadership. The

Re: Interrupting controlled shutdown breaks Kafka cluster

2014-11-07 Thread Guozhang Wang
Solon, Which version of Kafka are you running and are you enabling auto leader rebalance at the same time? Guozhang On Fri, Nov 7, 2014 at 8:41 AM, Solon Gordon wrote: > Hi all, > > My team has observed that if a broker process is killed in the middle of > the controlled shutdo

Interrupting controlled shutdown breaks Kafka cluster

2014-11-07 Thread Solon Gordon
Hi all, My team has observed that if a broker process is killed in the middle of the controlled shutdown procedure, the remaining brokers start spewing errors and do not properly rebalance leadership. The cluster cannot recover without major manual intervention. Here is how to reproduce the

Re: How to perform a controlled shutdown for rolling bounce?

2014-08-14 Thread Joel Koshy
12:59:16PM -0700, Ryan Williams wrote: > Thanks for clarifying. > > When I increase the replication factor, enable controlled shutdown and want > to do a controlled shutdown, do I still issue the same shutdown (SIGTERM)? > > > On Thu, Aug 14, 2014 at 11:40

Re: How to perform a controlled shutdown for rolling bounce?

2014-08-14 Thread Ryan Williams
Thanks for clarifying. When I increase the replication factor, enable controlled shutdown and want to do a controlled shutdown, do I still issue the same shutdown (SIGTERM)? On Thu, Aug 14, 2014 at 11:40 AM, Joel Koshy wrote: > Controlled shutdown does not really help in your case since y

Re: How to perform a controlled shutdown for rolling bounce?

2014-08-14 Thread Joel Koshy
Controlled shutdown does not really help in your case since your replication factor is one. > What does the -1 for Leader and blank Isr indicate? Do I need to run It means the partition is unavailable (since there are no other replicas). So you should either use a higher replication factor

How to perform a controlled shutdown for rolling bounce?

2014-08-14 Thread Ryan Williams
Running 0.8.1 and am unable to do a controlled shutdown as part of a rolling bounce. Is this the primary reference for this task? https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-1.ControlledShutdown I've set the config to enable controlled shu

Re: Controlled shutdown and leader election issues

2014-04-07 Thread Ryan Berdeen
; rberd...@hubspot.com> > > > > > wrote: > > > > > > > > > > > So, for 0.8 without "controlled.shutdown.enable", why would > > > > > ShutdownBroker > > > > > > and restarting cause

Re: Controlled shutdown and leader election issues

2014-04-03 Thread Clark Breyman
acefully? > > > > > > > > > > What's up with 0.8.1 getting stuck in preferred leader election? > > > > > > > > > > > > > > > On Fri, Mar 21, 2014 at 12:18 AM, Neha Narkhede < > > > neha.narkh...@g

Re: Controlled shutdown and leader election issues

2014-04-03 Thread Neha Narkhede
; > > > What's up with 0.8.1 getting stuck in preferred leader election? > > > > > > > > > > > > On Fri, Mar 21, 2014 at 12:18 AM, Neha Narkhede < > > neha.narkh...@gmail.com > > > > >wrote: > > > > > > >

Re: Controlled shutdown and leader election issues

2014-04-03 Thread Clark Breyman
ede < > neha.narkh...@gmail.com > > > >wrote: > > > > > > > Which brings up the question - Do we need ShutdownBroker anymore? It > > > seems > > > > like the config should handle controlled shutdown correctly anyway. > > > > >

Re: Controlled shutdown and leader election issues

2014-04-02 Thread Neha Narkhede
arkhede > >wrote: > > > > > Which brings up the question - Do we need ShutdownBroker anymore? It > > seems > > > like the config should handle controlled shutdown correctly anyway. > > > > > > Thanks, > > > Neha > > > > >

Re: Controlled shutdown and leader election issues

2014-04-02 Thread Clark Breyman
we need ShutdownBroker anymore? It > seems > > like the config should handle controlled shutdown correctly anyway. > > > > Thanks, > > Neha > > > > > > On Thu, Mar 20, 2014 at 9:16 PM, Jun Rao wrote: > > > > > We haven't been testin

Re: Controlled shutdown and leader election issues

2014-03-21 Thread Ryan Berdeen
Narkhede wrote: > Which brings up the question - Do we need ShutdownBroker anymore? It seems > like the config should handle controlled shutdown correctly anyway. > > Thanks, > Neha > > > On Thu, Mar 20, 2014 at 9:16 PM, Jun Rao wrote: > > > We haven't been t

Re: Controlled shutdown and leader election issues

2014-03-20 Thread Neha Narkhede
Which brings up the question - Do we need ShutdownBroker anymore? It seems like the config should handle controlled shutdown correctly anyway. Thanks, Neha On Thu, Mar 20, 2014 at 9:16 PM, Jun Rao wrote: > We haven't been testing the ShutdownBroker command in 0.8.1 rigorously > sin

Re: Controlled shutdown and leader election issues

2014-03-20 Thread Jun Rao
We haven't been testing the ShutdownBroker command in 0.8.1 rigorously since in 0.8.1, one can do the controlled shutdown through the new config "controlled.shutdown.enable". Instead of running the ShutdownBroker command during the upgrade, you can also wait until under replicated

Controlled shutdown and leader election issues

2014-03-20 Thread Ryan Berdeen
While upgrading from 0.8.0 to 0.8.1 in place, I observed some surprising behavior using kafka.admin.ShutdownBroker. At the start, there were no underreplicated partitions. After running bin/kafka-run-class.sh kafka.admin.ShutdownBroker --broker 10 ... Partitions that had replicas on broker 10 w

Re: Controlled shutdown and "UnderReplicatedPartitions" state?

2013-12-02 Thread Guozhang Wang
l#brokerconfigs On Mon, Dec 2, 2013 at 4:23 PM, Nitzan Harel wrote: > The default value for "controlled.shutdown.enable" is false. > Does that mean that stopping a broker without a controlled shutdown and > using a "kill ?9" might lead to an under "UnderReplicatedPartitions" state? > -- -- Guozhang

Controlled shutdown and "UnderReplicatedPartitions" state?

2013-12-02 Thread Nitzan Harel
The default value for "controlled.shutdown.enable" is false. Does that mean that stopping a broker without a controlled shutdown and using a "kill ?9" might lead to an under "UnderReplicatedPartitions" state?

Re: Controlled shutdown failure, retry settings

2013-11-03 Thread Jun Rao
A replica is dropped out of ISR if (1) it hasn't issue a fetch request for some time, or (2) it's behind the leader by some messages. The replica will be added back to ISR if neither condition is longer true. The actual value depends on the application. For example, if there is a spike and the fol

Re: Controlled shutdown failure, retry settings

2013-11-03 Thread Jason Rosenberg
Jun, Can you explain the difference between "failed" and "slow"? In either case, the follower drops out of the ISR, and can come back later if they catch up, no? In the configuration doc, it seems to describe them both with the same language: "if ., the leader will remove the follower from

Re: Controlled shutdown failure, retry settings

2013-11-02 Thread Jun Rao
replica.lag.time.max.ms is used to detect a failed broker. replica.lag.max.messages is used to detect a slow broker. Thanks, Jun On Fri, Nov 1, 2013 at 10:36 PM, Jason Rosenberg wrote: > In response to Joel's point, I think I do understand that messages can be > lost, if in fact we have dropp

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Jason Rosenberg
In response to Joel's point, I think I do understand that messages can be lost, if in fact we have dropped down to only 1 member in the ISR at the time the message is written, and then that 1 node goes down. What I'm not clear on, is the conditions under which a node can drop out of the ISR. You

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Neha Narkhede
For supporting more durability at the expense of availability, we have a JIRA that we will fix on trunk. This will allow you to configure the default as well as per topic durability vs availability behavior - https://issues.apache.org/jira/browse/KAFKA-1028 Thanks, Neha On Fri, Nov 1, 2013 at 1

Re: Controlled shutdown failure, retry settings

2013-11-01 Thread Joel Koshy
Unclean shutdown could result in data loss - since you are moving leadership to a replica that has fallen out of ISR. i.e., it's log end offset is behind the last committed message to this partition. >>> But if data is written with 'request.required.acks=-1', no data s

Re: Controlled shutdown failure, retry settings

2013-10-29 Thread Jason Rosenberg
I've filed: https://issues.apache.org/jira/browse/KAFKA-1108 On Tue, Oct 29, 2013 at 4:29 PM, Jason Rosenberg wrote: > Here's another exception I see during controlled shutdown (this time there > was not an unclean shutdown problem). Should I be concerned about this > exc

Re: Controlled shutdown failure, retry settings

2013-10-29 Thread Jason Rosenberg
Here's another exception I see during controlled shutdown (this time there was not an unclean shutdown problem). Should I be concerned about this exception? Is any data loss possible with this? This one happened after the first "Retrying controlled shutdown after the previous atte

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
On Fri, Oct 25, 2013 at 9:16 PM, Joel Koshy wrote: > > Unclean shutdown could result in data loss - since you are moving > leadership to a replica that has fallen out of ISR. i.e., it's log end > offset is behind the last committed message to this partition. > > But if data is written with 'reque

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Joel Koshy
On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg wrote: > It looks like when the controlled shutdown failes with an IOException, the > exception is swallowed, and we see nothing in the logs: > > catch { > case ioe: java

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
It looks like when the controlled shutdown failes with an IOException, the exception is swallowed, and we see nothing in the logs: catch { case ioe: java.io.IOException => channel.disconnect() channel = null // ignore

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
, > Neha > > > On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg wrote: > > > Ok, > > > > Looking at the controlled shutdown code, it appears that it can fail with > > an IOException too, in which case it won't report the remaining > partitions > &

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Neha Narkhede
On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg wrote: > Ok, > > Looking at the controlled shutdown code, it appears that it can fail with > an IOException too, in which case it won't report the remaining partitions > to replicate, etc. (I think that might be what I'

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
Ok, Looking at the controlled shutdown code, it appears that it can fail with an IOException too, in which case it won't report the remaining partitions to replicate, etc. (I think that might be what I'm seeing, since I never saw the log line for "controlled shutdown fail

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Neha Narkhede
Controlled shutdown can fail if the cluster has non zero under replicated partition count. Since that means the leaders may not move off of the broker being shutdown, causing controlled shutdown to fail. The backoff might help if the under replication is just temporary due to a spike in traffic

Re: Controlled shutdown failure, retry settings

2013-10-25 Thread Joel Koshy
On Fri, Oct 25, 2013 at 1:18 AM, Jason Rosenberg wrote: > I'm running into an issue where sometimes, the controlled shutdown fails to > complete after the default 3 retry attempts. This ended up in one case, > with a broker under going an unclean shutdown, and then it was in a rath

Controlled shutdown failure, retry settings

2013-10-25 Thread Jason Rosenberg
I'm running into an issue where sometimes, the controlled shutdown fails to complete after the default 3 retry attempts. This ended up in one case, with a broker under going an unclean shutdown, and then it was in a rather bad state after restart. Producers would connect to the metadat

Re: controlled shutdown

2013-08-14 Thread Joel Koshy
You can send it a SIGTERM signal (SIGKILL won't work). Thanks, Joel On Wed, Aug 14, 2013 at 8:05 AM, Yu, Libo wrote: > Hi, > > In this link > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools > two ways to do controlled shutdown are mentioned. "

controlled shutdown

2013-08-14 Thread Yu, Libo
Hi, In this link https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools two ways to do controlled shutdown are mentioned. "The first approach is to set "controlled.shutdown.enable" to true in the broker. Then, the broker will try to move all leaders on it to other brok

Re: question about Controlled Shutdown

2013-06-19 Thread Jason Rosenberg
Nice! Thanks, Jason On Wed, Jun 19, 2013 at 9:16 PM, Jun Rao wrote: > Actually, we recently added the option to enable controlled shutdown in the > broker shutdown hook. I have updated our wiki accordingly ( > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools). &

Re: question about Controlled Shutdown

2013-06-19 Thread Jun Rao
Actually, we recently added the option to enable controlled shutdown in the broker shutdown hook. I have updated our wiki accordingly ( https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools). Thanks, Jun On Wed, Jun 19, 2013 at 6:32 PM, Jason Rosenberg wrote: > Was just read

question about Controlled Shutdown

2013-06-19 Thread Jason Rosenberg
Was just reading about Controlled Shutdown here: https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools Is this something that can be invoked from code, from within a container running the KafkaServer? Currently I launch kafka.server.KafkaServer directly from our java app container