Google Drive would be great. Thanks!
On Thu, May 3, 2018 at 1:33 PM, Amit Jain <aj201...@gmail.com> wrote: > Hi Stephan, > > Size of JM log file is 122 MB. Could you provide me other media to > post the same? We can use Google Drive if that's fine with you. > > -- > Thanks, > Amit > > On Thu, May 3, 2018 at 12:58 PM, Stephan Ewen <se...@apache.org> wrote: > > Hi Amit! > > > > Thanks for sharing this, this looks like a regression with the network > stack > > changes. > > > > The log you shared from the TaskManager gives some hint, but that > exception > > alone should not be a problem. That exception can occur under a race > between > > deployment of some tasks while the whole job is entering a recovery phase > > (maybe we should not print it so prominently to not confuse users). There > > must be something else happening on the JobManager. Can you share the JM > > logs as well? > > > > Thanks a lot, > > Stephan > > > > > > On Wed, May 2, 2018 at 12:21 PM, Amit Jain <aj201...@gmail.com> wrote: > >> > >> Thanks! Fabian > >> > >> I will try using the current release-1.5 branch and update this thread. > >> > >> -- > >> Thanks, > >> Amit > >> > >> On Wed, May 2, 2018 at 3:42 PM, Fabian Hueske <fhue...@gmail.com> > wrote: > >> > Hi Amit, > >> > > >> > We recently fixed a bug in the network stack that affected batch jobs > >> > (FLINK-9144). > >> > The fix was added after your commit. > >> > > >> > Do you have a chance to build the current release-1.5 branch and check > >> > if > >> > the fix also resolves your problem? > >> > > >> > Otherwise it would be great if you could open a blocker issue for the > >> > 1.5 > >> > release to ensure that this is fixed. > >> > > >> > Thanks, > >> > Fabian > >> > > >> > 2018-04-29 18:30 GMT+02:00 Amit Jain <aj201...@gmail.com>: > >> >> > >> >> Cluster is running on commit 2af481a > >> >> > >> >> On Sun, Apr 29, 2018 at 9:59 PM, Amit Jain <aj201...@gmail.com> > wrote: > >> >> > Hi, > >> >> > > >> >> > We are running numbers of batch jobs in Flink 1.5 cluster and few > of > >> >> > those > >> >> > are getting stuck at random. These jobs having the following > failure > >> >> > after > >> >> > which operator status changes to CANCELED and stuck to same. > >> >> > > >> >> > Please find complete TM's log at > >> >> > https://gist.github.com/imamitjain/066d0e99990ee24f2da1ddc83eba20 > 12 > >> >> > > >> >> > > >> >> > 2018-04-29 14:57:24,437 INFO > >> >> > org.apache.flink.runtime.taskmanager.Task > >> >> > - Producer 98f5976716234236dc69fb0e82a0cc34 of partition > >> >> > 5b8dd5ac2df14266080a8e049defbe38 disposed. Cancelling execution. > >> >> > > >> >> > org.apache.flink.runtime.jobmanager.PartitionProducerDisposedExcep > tion: > >> >> > Execution 98f5976716234236dc69fb0e82a0cc34 producing partition > >> >> > 5b8dd5ac2df14266080a8e049defbe38 has already been disposed. > >> >> > at > >> >> > > >> >> > > >> >> > org.apache.flink.runtime.jobmaster.JobMaster. > requestPartitionState(JobMaster.java:610) > >> >> > at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source) > >> >> > at > >> >> > > >> >> > > >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > >> >> > at java.lang.reflect.Method.invoke(Method.java:498) > >> >> > at > >> >> > > >> >> > > >> >> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor. > handleRpcInvocation(AkkaRpcActor.java:210) > >> >> > at > >> >> > > >> >> > > >> >> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor. > handleMessage(AkkaRpcActor.java:154) > >> >> > at > >> >> > > >> >> > > >> >> > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor. > handleMessage(FencedAkkaRpcActor.java:69) > >> >> > at > >> >> > > >> >> > > >> >> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$ > onReceive$1(AkkaRpcActor.java:132) > >> >> > at > >> >> > > >> >> > akka.actor.ActorCell$$anonfun$become$1.applyOrElse( > ActorCell.scala:544) > >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > >> >> > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > >> >> > at > >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec( > ForkJoinTask.java:260) > >> >> > at > >> >> > > >> >> > > >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. > runTask(ForkJoinPool.java:1339) > >> >> > at > >> >> > > >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker( > ForkJoinPool.java:1979) > >> >> > at > >> >> > > >> >> > > >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run( > ForkJoinWorkerThread.java:107) > >> >> > > >> >> > > >> >> > Thanks > >> >> > Amit > >> > > >> > > > > > >