We have observed that some producer instances stopped sending traffic to
brokers, because the memory buffer is full. those producers got stuck in
this state permanently. Because we couldn't find out which broker is bad
here. So I did a rolling restart the all brokers. after the bad broker got
bounce, those stuck producers out of the woods automatically.

I don't know the exact problem with that bad broker. it seems to me that
some ZK states are inconsistent.

I know timeout fix from KAFKA-2120 can probably avoid the permanent stuck.
Here are some additional questions.
1) any suggestion on how to identify the bad broker(s)?
2) why bouncing of the bad broker got the producers recovered automatically
(without restarting producers)

producer: 0.8.2.1
broker: 0.8.2.1

Thanks,
Steven

Reply via email to