Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully one broker at a time on a live cluster?
I am seeing strange behaviors where many of my kafka topics become unusable (by both consumers and producers). When that happens, I see lots of errors in the server logs that look like this: [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with correlation id 2455 from client ReplicaFetcherThread-15-1007 on partition [risk,0] failed due to Topic risk either doesn't exist or is in the process of being deleted (kafka.server.KafkaApis) [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with correlation id 2455 from client ReplicaFetcherThread-7-1007 on partition [message,0] failed due to Topic message either doesn't exist or is in the process of being deleted (kafka.server.KafkaApis) When I try to consume a message from a topic that complained about the Topic not existing (above warning), I get the below exception: ....topic message --from-beginning SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [2014-04-09 10:40:30,571] WARN [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread], Failed to add leader for partitions [message,0]; will retry (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread) kafka.common.UnknownTopicOrPartitionException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at java.lang.Class.newInstance0(Class.java:355) at java.lang.Class.newInstance(Class.java:308) at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79) at kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167) at kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60) at kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179) at kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174) at scala.collection.immutable.Map$Map1.foreach(Map.scala:119) at kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174) at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86) at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76) at scala.collection.immutable.Map$Map1.foreach(Map.scala:119) at kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) ---------- *More details about my issues:* My current configuration in the environment where I am testing the upgrade is 4 physical servers running 2 brokers each with controlled shutdown feature enabled. When I shutdown the 2 brokers on one of the existing Kafka 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all is fine for a bit. Once, the new brokers come up, I ran the kafka-preferred-replica-election.sh to make sure that started brokers become leaders of existing topics. The replication factor on the topics is set to 4. I tested both producing and consuming messages against brokers that were leaders with kafka 0.8.0 and 0.8.1 and no issues were encountered. Later, I tried to perform the control shutdown of the 2 additional brokers on the Kafka server that has 0.8.0 version installed and after the broker shutdown and new leaders were assigned, all of my server logs are getting filled up with the above exceptions and most of my topics are not usable. I have pulled and build the 0.8.1 kafka code from git last thursday so I should be pretty much up to date. So not sure if I am doing something wrong or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a time is not supported. Is there a recommended migration approach that one should take when migrating from live 0.8.0 to 0.8.1 cluster? As to who is the leader of one of the topics that became unusable is the broker that was successfully upgraded to 0.8.1: Topic:message PartitionCount:1 ReplicationFactor:4 Configs: Topic: message Partition: 0 * Leader: 1007 * Replicas: 1007,8,9,1001 Isr: 1001,1007,8 Brokers 9 and 1009 where shutdown from one physical server that had kafka 0.8.0 installed when these problems started occurring (I was planning to upgrade them to 0.8.1). The only way I can recover from this state is to shutdown all brokers and delete all of kafka topic logs plus zookeeper kafka directory and start with new cluster. Your help in this matter is greatly appreciated. Thanks, Martin