We have observed that some producer instances stopped sending traffic to brokers, because the memory buffer is full. those producers got stuck in this state permanently. Because we couldn't find out which broker is bad here. So I did a rolling restart the all brokers. after the bad broker got bounce, those stuck producers out of the woods automatically.
I don't know the exact problem with that bad broker. it seems to me that some ZK states are inconsistent. I know timeout fix from KAFKA-2120 can probably avoid the permanent stuck. Here are some additional questions. 1) any suggestion on how to identify the bad broker(s)? 2) why bouncing of the bad broker got the producers recovered automatically (without restarting producers) producer: 0.8.2.1 broker: 0.8.2.1 Thanks, Steven