Hi Kafka users! I was just migrating a cluster of 3 brokers from one set of EC2 instances to another, but ran into replication problems. The method of migration used is that of stopping one broker and letting a new broker join with the same broker.id. Replication started, but after ~4 of ~15 GB the process stopped with the following errors getting logged every ~500ms.
On the new broker (the fetcher): [2014-11-04 17:02:33,762] ERROR [ReplicaFetcherThread-0-1926078608], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 1523; ClientId: ReplicaFetcherThread-0-1926078608; ReplicaId: 544181083; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [qa.mx-error,302] -> PartitionFetchInfo(0,10485760),[qa.xl-msg,46] -> PartitionFetchInfo(101768,10485760),[qa.xl-error,202] -> PartitionFetchInfo(0,10485760),[qa.mx-msg,177] -> ... total of 700+ partitions -> PartitionFetchInfo(0,10485760) (kafka.server.ReplicaFetcherThread) java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. at kafka.utils.Utils$.read(Utils.scala:376) at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) at kafka.network.Receive$class.readCompletely(Transmission.scala:56) at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29) at kafka.network.BlockingChannel.receive(BlockingChannel.scala:100) at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) [2014-11-04 17:02:33,765] WARN Reconnect due to socket error: null (kafka.consumer.SimpleConsumer) On one of the two old nodes (presumably the broker providing the data) [2014-11-04 17:03:28,030] ERROR Closing socket for /10.145.135.246 because of error (kafka.network.Processor) kafka.common.KafkaException: This operation cannot be completed on a complete request. at kafka.network.Transmission$class.expectIncomplete(Transmission.scala:34) at kafka.api.FetchResponseSend.expectIncomplete(FetchResponse.scala:191) at kafka.api.FetchResponseSend.writeTo(FetchResponse.scala:214) at kafka.network.Processor.write(SocketServer.scala:375) at kafka.network.Processor.run(SocketServer.scala:247) at java.lang.Thread.run(Thread.java:745) It looks similar to this previous post, but the thread doesn't seem to have a resolution to the problem. http://thread.gmane.org/gmane.comp.apache.kafka.user/1153 There is also this one, but again no resolution. http://thread.gmane.org/gmane.comp.apache.kafka.user/3804 Does anyone have any clues as to what might be going on here? And any suggestions for solutions? Thanks, Christofer