Re: Batch job stuck in Canceled state in Flink 1.5

Stephan Ewen Thu, 03 May 2018 05:08:17 -0700

Google Drive would be great.

Thanks!


On Thu, May 3, 2018 at 1:33 PM, Amit Jain <aj201...@gmail.com> wrote:

> Hi Stephan,
>
> Size of JM log file is 122 MB. Could you provide me other media to
> post the same? We can use Google Drive if that's fine with you.
>
> --
> Thanks,
> Amit
>
> On Thu, May 3, 2018 at 12:58 PM, Stephan Ewen <se...@apache.org> wrote:
> > Hi Amit!
> >
> > Thanks for sharing this, this looks like a regression with the network
> stack
> > changes.
> >
> > The log you shared from the TaskManager gives some hint, but that
> exception
> > alone should not be a problem. That exception can occur under a race
> between
> > deployment of some tasks while the whole job is entering a recovery phase
> > (maybe we should not print it so prominently to not confuse users). There
> > must be something else happening on the JobManager. Can you share the JM
> > logs as well?
> >
> > Thanks a lot,
> > Stephan
> >
> >
> > On Wed, May 2, 2018 at 12:21 PM, Amit Jain <aj201...@gmail.com> wrote:
> >>
> >> Thanks! Fabian
> >>
> >> I will try using the current release-1.5 branch and update this thread.
> >>
> >> --
> >> Thanks,
> >> Amit
> >>
> >> On Wed, May 2, 2018 at 3:42 PM, Fabian Hueske <fhue...@gmail.com>
> wrote:
> >> > Hi Amit,
> >> >
> >> > We recently fixed a bug in the network stack that affected batch jobs
> >> > (FLINK-9144).
> >> > The fix was added after your commit.
> >> >
> >> > Do you have a chance to build the current release-1.5 branch and check
> >> > if
> >> > the fix also resolves your problem?
> >> >
> >> > Otherwise it would be great if you could open a blocker issue for the
> >> > 1.5
> >> > release to ensure that this is fixed.
> >> >
> >> > Thanks,
> >> > Fabian
> >> >
> >> > 2018-04-29 18:30 GMT+02:00 Amit Jain <aj201...@gmail.com>:
> >> >>
> >> >> Cluster is running on commit 2af481a
> >> >>
> >> >> On Sun, Apr 29, 2018 at 9:59 PM, Amit Jain <aj201...@gmail.com>
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > We are running numbers of batch jobs in Flink 1.5 cluster and few
> of
> >> >> > those
> >> >> > are getting stuck at random. These jobs having the following
> failure
> >> >> > after
> >> >> > which operator status changes to CANCELED and stuck to same.
> >> >> >
> >> >> > Please find complete TM's log at
> >> >> > https://gist.github.com/imamitjain/066d0e99990ee24f2da1ddc83eba20
> 12
> >> >> >
> >> >> >
> >> >> > 2018-04-29 14:57:24,437 INFO
> >> >> > org.apache.flink.runtime.taskmanager.Task
> >> >> > - Producer 98f5976716234236dc69fb0e82a0cc34 of partition
> >> >> > 5b8dd5ac2df14266080a8e049defbe38 disposed. Cancelling execution.
> >> >> >
> >> >> > org.apache.flink.runtime.jobmanager.PartitionProducerDisposedExcep
> tion:
> >> >> > Execution 98f5976716234236dc69fb0e82a0cc34 producing partition
> >> >> > 5b8dd5ac2df14266080a8e049defbe38 has already been disposed.
> >> >> > at
> >> >> >
> >> >> >
> >> >> > org.apache.flink.runtime.jobmaster.JobMaster.
> requestPartitionState(JobMaster.java:610)
> >> >> > at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> >> >> > at java.lang.reflect.Method.invoke(Method.java:498)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.
> handleRpcInvocation(AkkaRpcActor.java:210)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.
> handleMessage(AkkaRpcActor.java:154)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.
> handleMessage(FencedAkkaRpcActor.java:69)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$
> onReceive$1(AkkaRpcActor.java:132)
> >> >> > at
> >> >> >
> >> >> > akka.actor.ActorCell$$anonfun$become$1.applyOrElse(
> ActorCell.scala:544)
> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> >> >> > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> >> >> > at
> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(
> ForkJoinTask.java:260)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
> runTask(ForkJoinPool.java:1339)
> >> >> > at
> >> >> >
> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(
> ForkJoinPool.java:1979)
> >> >> > at
> >> >> >
> >> >> >
> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java:107)
> >> >> >
> >> >> >
> >> >> > Thanks
> >> >> > Amit
> >> >
> >> >
> >
> >
>

Re: Batch job stuck in Canceled state in Flink 1.5

Reply via email to