Hello, we have a kafka cluster of 20 brokers (v0.8.2.1), and we are repeatedly running into trouble in a maintenance scenario. Each broker node uses 2 HDs to store the logs in our case (topic replication is 3 all over)
Typical maintenance scenario is that one of the disks on a node fails, so we stop the broker to get the disk replaced. After HD replacement, half of the former data is thus missing on the broker. When the node comes online again, it streams the missing partition data (i.e. mainly that of the replaced, fresh disk) for some hours. Our issue is that during that time of recovery, we are consistently running into instabilities on the side of our consumers (high-level consumer, kafka-committed offsets). The consumer groups quite often have to re-balance their partition assignment during this time, leading to hanging consumption in the end. If the consumer lag gets too big and we stop the recovering broker again for some time, or if the recovery of that broker has finally finished, everything stabilizes again. Is there some know problem in this respect, or better yet a recommendation how to deal with it...? Sounds somewhat like the problem mentioned in https://issues.apache.org/jira/browse/KAFKA-1464. Our impression is that once the recovering node becomes leader for some of its partitions already during recovery time, it still isn't able to serve those partitions properly e.g. due to network saturation. Hence, the broker seems to periodically gain and loose leadership for those partitions, which might explain the instabilities / rebalancing of the consumer groups. Our log output of the state-change logfiles seems to confirm this, i.e. we do see quite a bit of leadership swapping here, specifically for the partitions for which the recovering broker should normally be leader for. Any advice in this matter would be much appreciated. For example, if there was a way to prevent the recovering node from aquiring leadership for any partitions, I suppose this could solve our problems if we'd activate something like that during recovery time (manually). Thanks in advance, Ralph Weires