Hi, Kishore,

I saw a lot of NOT_LEADER_FOR_PARTITION errors in the first log. It seems
like your job is running for a while and started to lose the leader broker
for the topic in Kafka and the problem was not resolved by re-try. Are you
pumping through a lot of traffic to the broker? Can you try to take a topic
with very light traffic to see at least whether the job, or the Kafka
cluster is stable in a long run? Also, turning on debug in the log4j would
help to identify more issues.

Thanks a lot!

On Tue, Dec 15, 2015 at 5:02 AM, Kishore N C <kishor...@gmail.com> wrote:

> I did manage to locate a task which was blocked without starting as I had
> described earlier. The log for that task is here:
>
>
> https://gist.githubusercontent.com/kishorenc/ef806e85478378ce2203/raw/8537d3d3644eea7cdd9efc9fa8749c0840092f3c/gistfile1.txt
>
> Thanks,
>
> Kishore.
>
> On Tue, Dec 15, 2015 at 4:30 PM, Kishore N C <kishor...@gmail.com> wrote:
>
> > Hi Yi Pan,
> >
> > I'm using Samza 0.9.1 and Kafka 0.8.2.1. Here's an example of a full task
> > log:
> >
> >
> >
> https://gist.githubusercontent.com/kishorenc/5d65f114a50b9ef6a6b3/raw/5b9ecffdd1af831f713e8b41e5b77e5b881e8173/gistfile1.txt
> >
> > You will find "java.nio.channels.ClosedChannelException" towards the end.
> >
> > One additional thing to mention here is that I detect when the Samza job
> > has completed (by checking offsets periodically in the window() calls)
> and
> > issue a "taskCoordinator.shutdown(RequestScope.CURRENT_TASK)" in my code.
> >
> > In some other cases, an error like:
> >
> > "2015-12-15 10:09:54 ClientUtils$ [WARN] Fetching topic metadata with
> > correlation id 10 for topics [Set(TOPIC_TWO)] from broker
> > [id:2,host:10.181.18.171,port:9082] failed
> > java.nio.channels.ClosedChannelException"
> >
> > occurs right at the start of the job and the task would just refuse to
> > start. Unfortunately, I don't have the logs for such a container anymore.
> >
> > Thanks,
> >
> > Kishore.
> >
> > On Tue, Dec 15, 2015 at 12:31 AM, Yi Pan <nickpa...@gmail.com> wrote:
> >
> >> Hi, Kishore,
> >>
> >> First, I would like to ask which version of Samza you are running? And
> if
> >> you can attach the log and config of your container (i.e. I assume the
> log
> >> you attached here is a container log?), it would be greatly helpful.
> >>
> >> Thanks a lot!
> >>
> >> -Yi
> >>
> >> On Mon, Dec 14, 2015 at 5:07 AM, Kishore N C <kishor...@gmail.com>
> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have a 25 node Samza cluster and I am running a job on a dataset of
> a
> >> > billion records that is backed by a 7 node Kafka cluster.
> >> >
> >> > Some of the tasks on some of the Samza nodes don't seem to start at
> all
> >> > (while other tasks run fine on other nodes). The specific error
> message
> >> I
> >> > see is in the task log is:
> >> >
> >> > 2015-12-14 12:50:50 ClientUtils$ [INFO] Fetching metadata from broker
> >> > id:5,host:10.181.18.87,port:9082 with correlation id 0 for 2 topic(s)
> >> > Set(TOPIC_ONE, TOPIC_TWO)
> >> > 2015-12-14 12:50:50 SyncProducer [INFO] Connected to
> 10.181.18.87:9082
> >> for
> >> > producing
> >> > 2015-12-14 12:50:50 SyncProducer [INFO] Disconnecting from
> >> > 10.181.18.87:9082
> >> > 2015-12-14 12:51:22 SimpleConsumer [INFO] Reconnect due to socket
> error:
> >> > java.nio.channels.ClosedChannelException
> >> >
> >> > Sometimes, there is a variation like this:
> >> >
> >> > 2015-12-14 13:05:47 ClientUtils$ [WARN] Fetching topic metadata with
> >> > correlation id 0 for topics [Set(TOPIC_ONE, TOPIC_TWO)] from
> >> > broker [id:6,host:10.181.18.193,port:9082] failed
> >> > java.nio.channels.ClosedChannelException
> >> > at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
> >> > at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
> >> > at
> >> >
> >> >
> >>
> kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
> >> > at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
> >> > at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
> >> > at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.util.ClientUtilTopicMetadataStore.getTopicInfo(ClientUtilTopicMetadataStore.scala:37)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.system.kafka.KafkaSystemAdmin.getTopicMetadata(KafkaSystemAdmin.scala:214)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2$$anonfun$6.apply(KafkaSystemAdmin.scala:158)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.system.kafka.TopicMetadataCache$.getTopicMetadata(TopicMetadataCache.scala:52)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:155)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.system.kafka.KafkaSystemAdmin$$anonfun$getSystemStreamMetadata$2.apply(KafkaSystemAdmin.scala:154)
> >> > at
> >> >
> >> >
> >>
> org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:82)
> >> >
> >> > The above logs just keeps looping and the task never starts processing
> >> > input. I was able to telnet into the host/port combination from the
> same
> >> > machine. Any idea/pointers to what could be going wrong is greatly
> >> > appreciated.
> >> >
> >> > Thanks,
> >> >
> >> > KN.
> >> >
> >>
> >
> >
> >
> > --
> > It is our choices that show what we truly are,
> > far more than our abilities.
> >
>
>
>
> --
> It is our choices that show what we truly are,
> far more than our abilities.
>

Reply via email to