I unfortunately do not have any specific logs from these events but I will try and describe the events as accurately as possible to give an idea of the problem I saw.
The odd behavior manifested itself when I bounced all of the kafka processes on each of the servers in a 12 node cluster. A few weeks prior I did a partition reassignment to add four new kafka brokers to the cluster. This cluster has 4 topics on it each with 350 partitions each, a retention policy of 6 hours, and a replication factor of 1. Originally I attempted to run a migration on all of the topics and partitions adding the 4 new nodes using the partition reassignment tool. This seemed to cause a lot of network congestion and according to the logs some of the nodes were having trouble talking to each other. The network congestion lasted for the duration of the migration and began to get better toward the end. After the migration I confirmed that data was being stored and served from the new brokers. Today I bounced each of the kafka processes on each of the brokers to pick up a change made to the log4j properties. After bouncing one processes I started seeing some strange errors on the four newer broker nodes that looked like: kafka.common.NotAssignedReplicaException: Leader 10 failed to record follower 7's position 0 for partition [topic-1,185] since the replica 7 is not recognized to be one of the assigned replicas 10 for partition [topic-2,185] and on the older kafka brokers the errors looked like: [2014-12-01 17:06:04,268] ERROR [ReplicaFetcherThread-0-12], Error for partition [topic-1,175] to broker 12:class kafka.common.UnknownException (kafka.server.ReplicaFetcherThread) I proceeded to bounce the rest of the kafka processes and after bouncing the rest the errors seemed to stop. It wasn’t until a few hours later I noticed that the amount of data stored on the 4 new kafka brokers had dropped off significantly. When I ran a describe for the topics in the errors it was clear that the assigned partitions had been reverted to a state prior to the original migration to add the 4 new brokers. I am unsure of why bouncing the kafka process would cause the state in zookeeper to get overwritten given that it had seemed to have been working for the last few weeks until the process was restarted. My hunch is that the controller keeps some state about the world pre-reassignment and removes that state after it detects that the reassignment happened successfully. In this case the network congestion on each of the brokers caused the controller not to get notified when all the reassignments were completed and thus kept the pre-assignement state around. When the process was bounced it read from zookeeper to get this state and reverted the existing scheme to the pre-assignment state. Has this behavior been observed before? Does this sound like a logical understanding of what happened in this case? -- Andrew Jorgensen @ajorgensen