1) any suggestion on how to identify the bad broker(s)? ---> At Linkedin we have alerts that are setup using our internal scripts for detecting if a broker has gone bad. We also check the under replicated partitions and that can tell us which broker has gone bad. By broker going bad, it can mean different things. Like the broker is alive but not responding and is completely isolated or the broker has gone down, etc. Can you tell us what you meant by your BROKER went BAD?
2) why bouncing of the bad broker got the producers recovered automatically ----> This is because as you bounced, the leaders for other partitions changed and producer sent out a TopicMetadataRequest which tells the producer who are the new leaders for the partitions and the producer started sending messages to those brokers. KAFKA-2120 will handle all of this for you automatically. Thanks, Mayuresh On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <[email protected]> wrote: > We have observed that some producer instances stopped sending traffic to > brokers, because the memory buffer is full. those producers got stuck in > this state permanently. Because we couldn't find out which broker is bad > here. So I did a rolling restart the all brokers. after the bad broker got > bounce, those stuck producers out of the woods automatically. > > I don't know the exact problem with that bad broker. it seems to me that > some ZK states are inconsistent. > > I know timeout fix from KAFKA-2120 can probably avoid the permanent stuck. > Here are some additional questions. > 1) any suggestion on how to identify the bad broker(s)? > 2) why bouncing of the bad broker got the producers recovered automatically > (without restarting producers) > > producer: 0.8.2.1 > broker: 0.8.2.1 > > Thanks, > Steven > -- -Regards, Mayuresh R. Gharat (862) 250-7125
