Hi,

I have a 25 node Samza cluster and I am running a job on a dataset of a
billion records that is backed by a 7 node Kafka cluster.

Some of the tasks on some of the Samza nodes don't seem to start at all
(while other tasks run fine on other nodes). The specific error message I
see is in the task log is:

2015-12-14 12:50:50 ClientUtils$ [INFO] Fetching metadata from broker
id:5,host:10.181.18.87,port:9082 with correlation id 0 for 2 topic(s)
Set(TOPIC_ONE, TOPIC_TWO)
2015-12-14 12:50:50 SyncProducer [INFO] Connected to 10.181.18.87:9082 for
producing
2015-12-14 12:50:50 SyncProducer [INFO] Disconnecting from 10.181.18.87:9082
2015-12-14 12:51:22 SimpleConsumer [INFO] Reconnect due to socket error:
java.nio.channels.ClosedChannelException

Sometimes, there is a variation like this:

2015-12-14 13:05:47 ClientUtils$ [WARN] Fetching topic metadata with
correlation id 0 for topics [Set(TOPIC_ONE, TOPIC_TWO)] from
broker [id:6,host:10.181.18.193,port:9082] failed
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
at
kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at
org.apache.samza.util.ClientUtilTopicMetadataStore.getTopicInfo(ClientUtilTopicMetadataStore.scala:37)
at
org.apache.samza.system.kafka.KafkaSystemAdmin.getTopicMetadata(KafkaSystemAdmin.scala:214)
at
org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158)
at
org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158)
at
org.apache.samza.system.kafka.TopicMetadataCache$.getTopicMetadata(TopicMetadataCache.scala:52)
at
org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:155)
at
org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:154)
at
org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:82)

The above logs just keeps looping and the task never starts processing
input. I was able to telnet into the host/port combination from the same
machine. Any idea/pointers to what could be going wrong is greatly
appreciated.

Thanks,

KN.

Reply via email to