Hi, I have a 25 node Samza cluster and I am running a job on a dataset of a billion records that is backed by a 7 node Kafka cluster.
Some of the tasks on some of the Samza nodes don't seem to start at all (while other tasks run fine on other nodes). The specific error message I see is in the task log is: 2015-12-14 12:50:50 ClientUtils$ [INFO] Fetching metadata from broker id:5,host:10.181.18.87,port:9082 with correlation id 0 for 2 topic(s) Set(TOPIC_ONE, TOPIC_TWO) 2015-12-14 12:50:50 SyncProducer [INFO] Connected to 10.181.18.87:9082 for producing 2015-12-14 12:50:50 SyncProducer [INFO] Disconnecting from 10.181.18.87:9082 2015-12-14 12:51:22 SimpleConsumer [INFO] Reconnect due to socket error: java.nio.channels.ClosedChannelException Sometimes, there is a variation like this: 2015-12-14 13:05:47 ClientUtils$ [WARN] Fetching topic metadata with correlation id 0 for topics [Set(TOPIC_ONE, TOPIC_TWO)] from broker [id:6,host:10.181.18.193,port:9082] failed java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) at kafka.producer.SyncProducer.send(SyncProducer.scala:113) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) at org.apache.samza.util.ClientUtilTopicMetadataStore.getTopicInfo(ClientUtilTopicMetadataStore.scala:37) at org.apache.samza.system.kafka.KafkaSystemAdmin.getTopicMetadata(KafkaSystemAdmin.scala:214) at org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158) at org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158) at org.apache.samza.system.kafka.TopicMetadataCache$.getTopicMetadata(TopicMetadataCache.scala:52) at org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:155) at org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:154) at org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:82) The above logs just keeps looping and the task never starts processing input. I was able to telnet into the host/port combination from the same machine. Any idea/pointers to what could be going wrong is greatly appreciated. Thanks, KN.