You could have had split brain where multiple brokers thought they were the controller. This has been a problematic piece of Kafka for a long time and only in the last release or two have some of the edge cases been cleaned up.
To force a controller re-election, remove the "/controller" znode from zookeeper, then tail logs on all brokers and you'll see when the controller is stable again. You can do this as a zero downtime operation. On Fri, Mar 23, 2018, 10:35 AM Swapnil Gupta <neomatrix1...@gmail.com> wrote: > As far as I have debugged my issues, server.log files were of the most use, > they have errors and 'caused by' reasons. > > Probably, the leader broker was down and the followers were not able to > replicate data now. > 1. What are your min isr and replication settings? > 2. Is your producer working? > 3. What are consumer /producer ack settings? > 4. Are you using log compaction by any chance.. > > If consumer is not able to fetch, check if those partitions are offline, if > not the followers might not have caught with the leader, but the leader > itself is unavailable, you might have to 'reassign partitions' or use > 'leader election tool to shift those' to another node. > I have faced this issue but it was due to log compaction and too many files > open, which was taking my node down, one by one. > > I use prometheus grafana to monitor/alert all these metrics about offline > partitions, topics and brokers states. Is it showing anything weird if you > have set it up for production use? > Try getting up the broker which went down, and check if its reassigning or > not, check for some stability and monitor what the logs are showing. The > leader should be available if the node is up. If not get another node with > same broker id, and try run leader election again ( I use kafka manager for > those tools, makes it easy), this will shift re balance the partitions to > the new node and should balance your partitions. > > Just mentioning some things that I did when my nodes were going down. You > can analyse and try if any of this might be useful. > > > Regards, > Swapnil > > On Fri, Mar 23, 2018 at 9:48 PM, Ryan O'Rourke <rorou...@marchex.com> > wrote: > > > Thanks for the tips. > > > > As far as a broker being down, the situation is confused because a broker > > was definitely down, but I think we brought it down when we were trying > to > > address the problem a few days ago. I can't say for sure that it wasn't > > down before, but we do have a monitor to catch that (by trying to telnet > to > > the broker) and it was not firing on the 14th when things went sideways. > > > > I looked in server.logs and found many instances of this exception at the > > time things went bad: > > > > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > > server is not the leader for that topic-partition. > > > > Also some like "error when handling request Name:UpdateMetadataRequest" > > > > There's a few different sorts of errors in some of the different > > server.log but they all seem to be varieties of this one. We have 5 > nodes - > > is there one in particular where I should be checking the logs for > errors, > > or any particular error to look for that would be a likely root cause? > > > > -----Original Message----- > > From: Swapnil Gupta [mailto:neomatrix1...@gmail.com] > > Sent: Friday, March 23, 2018 8:00 AM > > To: users@kafka.apache.org > > Subject: Re: can't consume from partitions due to KAFKA-3963 > > > > Maybe a broker is down or unreachable which maybe breaking your min isr > > ratio and when consumers are set to ack all, the min isr has to be > > satisfied. > > Check your broker connect, or bring up a fresh broker and use preferred > > replica leader election tool -> https://cwiki.apache.org/ > > confluence/display/KAFKA/Replication+tools > > to re balance with the existing data with the new broker. > > > > Check which node is down, check isr, under replication and then you get > > the node id which maybe causing the trouble. > > And better check *server.log* files and grep for the error and caused by > > over there, that will provide you with exact reason. > > > > > > Regards, > > Swap > > > > On Fri, Mar 23, 2018 at 1:50 AM, Ryan O'Rourke <rorou...@marchex.com> > > wrote: > > > > > Hi, we're having an outage in our production Kafka and getting > > > desperate, any help would be appreciated. > > > > > > On 3/14 our consumer (a Storm spout) started getting messages from > > > only 20 out of 40 partitions on a topic. We only noticed yesterday. > > > Restarting the consumer with a new consumer group does not fix the > > problem. > > > > > > We just found some errors in the Kafka state change log which look > > > like they may be related - the example is definitely one of the > > > affected partition, and the timestamp lines up with when the problem > > > started. Seems to be related to KAFKA-3963. What can we do to mitigate > > > this and prevent it from happening again? > > > > > > kafka.common.NoReplicaOnlineException: No replica for partition > > > [transcription-results,9] is alive. Live brokers are: [Set()], > > > Assigned replicas are: [List(1, 4, 0)] > > > [2018-03-14 03:11:40,863] TRACE Controller 0 epoch 44 changed state of > > > replica 1 for partition [transcription-results,9] from OnlineReplica > > > to OfflineReplica (state.change.logger) > > > [2018-03-14 03:11:41,141] TRACE Controller 0 epoch 44 sending > > > become-follower LeaderAndIsr request > > > (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) > > > to broker 4 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,145] TRACE Controller 0 epoch 44 sending > > > become-follower LeaderAndIsr request > > > (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) > > > to broker 0 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,208] TRACE Controller 0 epoch 44 changed state of > > > replica 4 for partition [transcription-results,9] from OnlineReplica > > > to OnlineReplica (state.change.logger) > > > [2018-03-14 03:11:41,218] TRACE Controller 0 epoch 44 changed state of > > > replica 1 for partition [transcription-results,9] from OfflineReplica > > > to OnlineReplica (state.change.logger) > > > [2018-03-14 03:11:41,226] TRACE Controller 0 epoch 44 sending > > > become-follower LeaderAndIsr request > > > (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) > > > to broker 4 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,230] TRACE Controller 0 epoch 44 sending > > > become-follower LeaderAndIsr request > > > (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) > > > to broker 1 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,450] TRACE Broker 0 received LeaderAndIsr request > > > (LeaderAndIsrInfo:(Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:4 > > > 4), > > > ReplicationFactor:3),AllReplicas:1,4,0) correlation id 158 from > > > controller 0 epoch 44 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,454] TRACE Broker 0 handling LeaderAndIsr request > > > correlationId 158 from controller 0 epoch 44 starting the > > > become-follower transition for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,455] ERROR Broker 0 received LeaderAndIsrRequest > > > with correlation id 158 from controller 0 epoch 44 for partition > > > [transcription-results,9] but cannot become follower since the new > > > leader > > > -1 is unavailable. (state.change.logger) > > > [2018-03-14 03:11:41,459] TRACE Broker 0 completed LeaderAndIsr > > > request correlationId 158 from controller 0 epoch 44 for the > > > become-follower transition for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,682] TRACE Controller 0 epoch 44 started leader > > > election for partition [transcription-results,9] (state.change.logger) > > > [2018-03-14 03:11:41,687] TRACE Controller 0 epoch 44 elected leader 4 > > > for Offline partition [transcription-results,9] (state.change.logger) > > > [2018-03-14 03:11:41,689] TRACE Controller 0 epoch 44 changed > > > partition [transcription-results,9] from OfflinePartition to > > > OnlinePartition with leader 4 (state.change.logger) > > > [2018-03-14 03:11:41,825] TRACE Controller 0 epoch 44 sending > > > become-leader LeaderAndIsr request > > > (Leader:4,ISR:4,LeaderEpoch:443,ControllerEpoch:44) > > > to broker 4 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,826] TRACE Controller 0 epoch 44 sending > > > become-follower LeaderAndIsr request > > > (Leader:4,ISR:4,LeaderEpoch:443,ControllerEpoch:44) > > > to broker 1 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,899] TRACE Broker 0 cached leader info > > > (LeaderAndIsrInfo:(Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch: > > > 44), > > > ReplicationFactor:3),AllReplicas:1,4,0) for partition > > > [transcription-results,9] in response to UpdateMetadata request sent > > > by controller 1 epoch 47 with correlation id 0 (state.change.logger) > > > [2018-03-14 03:11:41,906] TRACE Broker 0 received LeaderAndIsr request > > > (LeaderAndIsrInfo:(Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch: > > > 44), > > > ReplicationFactor:3),AllReplicas:1,4,0) correlation id 1 from > > > controller > > > 1 epoch 47 for partition [transcription-results,9] > > > (state.change.logger) > > > [2018-03-14 03:11:41,908] WARN Broker 0 ignoring LeaderAndIsr request > > > from controller 1 with correlation id 1 epoch 47 for partition > > > [transcription-results,9] since its associated leader epoch 441 is old. > > > Current leader epoch is 441 (state.change.logger) > > > [2018-03-14 03:11:41,982] TRACE Broker 0 cached leader info > > > (LeaderAndIsrInfo:(Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch: > > > 44), > > > ReplicationFactor:3),AllReplicas:1,4,0) for partition > > > [transcription-results,9] in response to UpdateMetadata request sent > > > by controller 1 epoch 47 with correlation id 2 (state.change.logger) > > > [2018-03-22 14:43:36,098] TRACE Broker 0 received LeaderAndIsr request > > > (LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:444,ControllerEpoch:47), > > > ReplicationFactor:3),AllReplicas:1,4,0) correlation id 679 from > > > controller 1 epoch 47 for partition [transcription-results,9] > > > (state.change.logger) > > > > > > > > > > > > > > > -- > > <https://in.linkedin.com/in/swappy> > > > > > > -- > <https://in.linkedin.com/in/swappy> >