Hello, Last week I upgraded one relatively large kafka (EC2, 10 brokers, ~30 TB data, 100-300 Mbps in/out per instance) 0.10.0.1 cluster to 1.0, and saw some issues.
Out of ~100 topics with 2..20 partitions each, 9 partitions in 8 topics become "unavailable" across 3 brokers. The leader was shown as -1 and ISR was empty. Java service using 0.10.0.1 clients was unable to send any data to these partitions so it got dropped. The partitions were shown on the `kafka/bin/kafka-topics.sh --zookeeper <zk's> --unavailable-partitions --describe` output. Nothing special about these partitions, among them were big ones (hundreds of gigs) and tiny ones (megabytes). The fix was to set up the unclean leader elections and restart one of the affected brokers in each partition: `kafka/bin/kafka-configs.sh --zookeeper <zk's> --entity-type topics --entity-name <topicname> --add-config unclean.leader.election.enable=true --alter`. Anyone seen something like this, how to avoid it when next upgrading perchance? Maybe it would be better if said cluster got no traffic during upgrade, but we cannot have a maintenance break as everything is up 24/7. Cluster is for analytics data, some of which is consumed in real-time applications, mostly by secor. BR, Mika -- *Mika Linnanoja* Senior Cloud Engineer Games Technology Rovio Entertainment Corp Keilaranta 7, FIN - 02150 Espoo, Finland mika.linnan...@rovio.com www.rovio.com