For some reason, I am not able to get the “under-replicated partitions” metric on my Kafka cluster to zero across all nodes. Even after I manually reassign all the partitions, one server still has 928 under-replicated partitions. Also, the number of partitions each server is leading is very uneven, it ranges from 268 up to 2,098.
In server.log, I see many messages like this for various partitions: DateTime=[2018-09-27 18:57:21,133] Type=WARN Message="[ReplicaManager broker=167927108] While recording the replica LEO, the partition funnel-metrics-3 hasn't been created." Class=(kafka.server.ReplicaManager) Also, Kafka-reassign-partitions.sh “--verify” shows many occurrences of “Reassignment of partition such-and-such-0 failed” for various partitions. Meanwhile, on clients trying to write messages into Kafka, I see messages like, “logger=org.apache.kafka.clients.NetworkClient, , message="Error while fetching metadata with correlation id 274 : {alpha-checkout-event=INVALID_REPLICATION_FACTOR}"” And “logger=org.apache.kafka.clients.producer.internals.Sender, , message="Got error produce response with correlation id 580 on topic-partition usersignals-14, retrying (10 attempts left). Error: NOT_LEADER_FOR_PARTITION"” And “logger=c.expedia.www.hendrix.generator.framework.kafka.KafkaConsumerRunnable, , message="Kafka producer asynchronous send Future failed. Topic: tnl-exposure-logs Partition: null"” Does anyone have any idea what the problem is, or what can I do about it? Thanks!