The first log was from a time when there was quite a bit of load on the Kafka servers. But in the second link I shared you, will notice that the task did not even start - this was when the traffic was very light on the Kafka cluster.
I enable debug mode to see if that uncovers more. On Dec 16, 2015 12:27 AM, "Yi Pan" <nickpa...@gmail.com> wrote: > Hi, Kishore, > > I saw a lot of NOT_LEADER_FOR_PARTITION errors in the first log. It seems > like your job is running for a while and started to lose the leader broker > for the topic in Kafka and the problem was not resolved by re-try. Are you > pumping through a lot of traffic to the broker? Can you try to take a topic > with very light traffic to see at least whether the job, or the Kafka > cluster is stable in a long run? Also, turning on debug in the log4j would > help to identify more issues. > > Thanks a lot! > > On Tue, Dec 15, 2015 at 5:02 AM, Kishore N C <kishor...@gmail.com> wrote: > > > I did manage to locate a task which was blocked without starting as I had > > described earlier. The log for that task is here: > > > > > > > https://gist.githubusercontent.com/kishorenc/ef806e85478378ce2203/raw/8537d3d3644eea7cdd9efc9fa8749c0840092f3c/gistfile1.txt > > > > Thanks, > > > > Kishore. > > > > On Tue, Dec 15, 2015 at 4:30 PM, Kishore N C <kishor...@gmail.com> > wrote: > > > > > Hi Yi Pan, > > > > > > I'm using Samza 0.9.1 and Kafka 0.8.2.1. Here's an example of a full > task > > > log: > > > > > > > > > > > > https://gist.githubusercontent.com/kishorenc/5d65f114a50b9ef6a6b3/raw/5b9ecffdd1af831f713e8b41e5b77e5b881e8173/gistfile1.txt > > > > > > You will find "java.nio.channels.ClosedChannelException" towards the > end. > > > > > > One additional thing to mention here is that I detect when the Samza > job > > > has completed (by checking offsets periodically in the window() calls) > > and > > > issue a "taskCoordinator.shutdown(RequestScope.CURRENT_TASK)" in my > code. > > > > > > In some other cases, an error like: > > > > > > "2015-12-15 10:09:54 ClientUtils$ [WARN] Fetching topic metadata with > > > correlation id 10 for topics [Set(TOPIC_TWO)] from broker > > > [id:2,host:10.181.18.171,port:9082] failed > > > java.nio.channels.ClosedChannelException" > > > > > > occurs right at the start of the job and the task would just refuse to > > > start. Unfortunately, I don't have the logs for such a container > anymore. > > > > > > Thanks, > > > > > > Kishore. > > > > > > On Tue, Dec 15, 2015 at 12:31 AM, Yi Pan <nickpa...@gmail.com> wrote: > > > > > >> Hi, Kishore, > > >> > > >> First, I would like to ask which version of Samza you are running? And > > if > > >> you can attach the log and config of your container (i.e. I assume the > > log > > >> you attached here is a container log?), it would be greatly helpful. > > >> > > >> Thanks a lot! > > >> > > >> -Yi > > >> > > >> On Mon, Dec 14, 2015 at 5:07 AM, Kishore N C <kishor...@gmail.com> > > wrote: > > >> > > >> > Hi, > > >> > > > >> > I have a 25 node Samza cluster and I am running a job on a dataset > of > > a > > >> > billion records that is backed by a 7 node Kafka cluster. > > >> > > > >> > Some of the tasks on some of the Samza nodes don't seem to start at > > all > > >> > (while other tasks run fine on other nodes). The specific error > > message > > >> I > > >> > see is in the task log is: > > >> > > > >> > 2015-12-14 12:50:50 ClientUtils$ [INFO] Fetching metadata from > broker > > >> > id:5,host:10.181.18.87,port:9082 with correlation id 0 for 2 > topic(s) > > >> > Set(TOPIC_ONE, TOPIC_TWO) > > >> > 2015-12-14 12:50:50 SyncProducer [INFO] Connected to > > 10.181.18.87:9082 > > >> for > > >> > producing > > >> > 2015-12-14 12:50:50 SyncProducer [INFO] Disconnecting from > > >> > 10.181.18.87:9082 > > >> > 2015-12-14 12:51:22 SimpleConsumer [INFO] Reconnect due to socket > > error: > > >> > java.nio.channels.ClosedChannelException > > >> > > > >> > Sometimes, there is a variation like this: > > >> > > > >> > 2015-12-14 13:05:47 ClientUtils$ [WARN] Fetching topic metadata with > > >> > correlation id 0 for topics [Set(TOPIC_ONE, TOPIC_TWO)] from > > >> > broker [id:6,host:10.181.18.193,port:9082] failed > > >> > java.nio.channels.ClosedChannelException > > >> > at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) > > >> > at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73) > > >> > at > > >> > > > >> > > > >> > > > kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72) > > >> > at kafka.producer.SyncProducer.send(SyncProducer.scala:113) > > >> > at > kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58) > > >> > at > kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.util.ClientUtilTopicMetadataStore.getTopicInfo(ClientUtilTopicMetadataStore.scala:37) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.system.kafka.KafkaSystemAdmin.getTopicMetadata(KafkaSystemAdmin.scala:214) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.system.kafka.TopicMetadataCache$.getTopicMetadata(TopicMetadataCache.scala:52) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:155) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:154) > > >> > at > > >> > > > >> > > > >> > > > org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:82) > > >> > > > >> > The above logs just keeps looping and the task never starts > processing > > >> > input. I was able to telnet into the host/port combination from the > > same > > >> > machine. Any idea/pointers to what could be going wrong is greatly > > >> > appreciated. > > >> > > > >> > Thanks, > > >> > > > >> > KN. > > >> > > > >> > > > > > > > > > > > > -- > > > It is our choices that show what we truly are, > > > far more than our abilities. > > > > > > > > > > > -- > > It is our choices that show what we truly are, > > far more than our abilities. > > >