Re: No zero ( 2 ) exit code on k8s StandaloneJobClusterEntryPoint when save point with cancel...

Vishal Santoshi Thu, 25 Apr 2019 08:19:23 -0700

Here you go https://issues.apache.org/jira/browse/FLINK-12333


Again thanks for the prompt response



On Wed, Apr 24, 2019 at 11:06 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Good to hear. Could you create a documentation JIRA issue for this
> problem? Thanks a lot.
>
> Cheers,
> Till
>
> On Wed, Apr 24, 2019 at 4:58 PM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> Verified, I think we just need to make sure that it is documented :)
>>
>> On Wed, Apr 24, 2019 at 9:47 AM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> This makes total sense and actually is smart ( defensive ). Will test
>>> and report. I think though that this needs to be documented :)
>>>
>>> On Wed, Apr 24, 2019 at 6:03 AM Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> it seems that the following is happening: You triggered the cancel with
>>>> savepoint command from via the REST call. This command is an asynchronous
>>>> operation which produces a result (the savepoint path). In order to deliver
>>>> asynchronous results to the caller, Flink waits before shutting down until
>>>> they are delivered or until it times out after 5 minutes. I assume that you
>>>> don't request the savepoint path from Flink via the returned URL from the
>>>> original request. This could either happen if you kill the CLI before its
>>>> done or if you have written your own method to trigger this operation.
>>>>
>>>> I guess we could add a flag for asynchronous operations which tells
>>>> Flink that their results don't need to get delivered to some client. If you
>>>> would like to have such a feature, then please open a JIRA issue for it.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Wed, Apr 24, 2019 at 3:49 AM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> Anyione ?
>>>>>
>>>>>
>>>>>
>>>>> I think there some race condition .  These are the TM logs.. I am
>>>>> puzzled  b'coz in a larger pipe ( there are about 32 lots on 8 replicas 
>>>>> and
>>>>> it works
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2019-04-24 01:16:20,889 DEBUG
>>>>> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
>>>>> Releasing local state under allocation id 
>>>>> 5a853ef886e1c599f86b9503306fffd2.
>>>>>
>>>>> 2019-04-24 01:16:20,894 DEBUG
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close
>>>>> JobManager connection for job 00000000000000000000000000000000.
>>>>>
>>>>> org.apache.flink.util.FlinkException: Stopping JobMaster for job
>>>>> EventCountJob(00000000000000000000000000000000).
>>>>>
>>>>> at
>>>>> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:355)
>>>>>
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:504)
>>>>>
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:170)
>>>>>
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>>>>>
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>>>>>
>>>>> at
>>>>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>>>>
>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>>>
>>>>> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>>>>
>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>>>
>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>>>>
>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>>>
>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>>>
>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>>>
>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>
>>>>> at
>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>
>>>>> at
>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>
>>>>> at
>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>
>>>>> 2019-04-24 01:16:20,895 INFO
>>>>> org.apache.flink.runtime.taskexecutor.JobLeaderService        -
>>>>> Cannot reconnect to job 00000000000000000000000000000000 because it is not
>>>>> registered.
>>>>>
>>>>> 2019-04-24 01:16:21,053 DEBUG
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>> Received heartbeat request from e61c2b7d992f151936e21db1ca06666d.
>>>>>
>>>>> 2019-04-24 01:16:22,136 DEBUG
>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>> Got ping response for sessionid: 0x25add5478fb2ec6 after 0ms
>>>>>
>>>>> 2019-04-24 01:16:31,052 DEBUG
>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>> Received heartbeat request from e61c2b7d992f151936e21db1ca06666d.
>>>>>
>>>>> 2019-04-24 01:16:35,483 DEBUG
>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>> Got ping response for sessionid: 0x25add5478fb2ec6 after 0ms
>>>>>
>>>>> On Tue, Apr 23, 2019 at 3:11 PM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> I see this in the TM pod
>>>>>>
>>>>>> 2019-04-23 19:08:41,828 DEBUG
>>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>>> Got ping response for sessionid: 0x15cc7f3d88466a5 after 0ms
>>>>>>
>>>>>> 2019-04-23 19:08:47,543 DEBUG
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>> Received heartbeat request from 6b7dd7b5032c089bff8a77f75de65c22.
>>>>>>
>>>>>> 2019-04-23 19:08:55,175 DEBUG
>>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>>> Got ping response for sessionid: 0x15cc7f3d88466a5 after 1ms
>>>>>>
>>>>>> 2019-04-23 19:08:57,548 DEBUG
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>> Received heartbeat request from 6b7dd7b5032c089bff8a77f75de65c22.
>>>>>>
>>>>>> 2019-04-23 19:09:07,543 DEBUG
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>> Received heartbeat request from 6b7dd7b5032c089bff8a77f75de65c22.
>>>>>>
>>>>>> 2019-04-23 19:09:08,523 DEBUG
>>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>>> Got ping response for sessionid: 0x15cc7f3d88466a5 after 0ms
>>>>>>
>>>>>> 2019-04-23 19:09:17,542 DEBUG
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>> Received heartbeat request from 6b7dd7b5032c089bff8a77f75de65c22.
>>>>>>
>>>>>> 2019-04-23 19:09:21,871 DEBUG
>>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>>> Got ping response for sessionid: 0x15cc7f3d88466a5 after 0ms
>>>>>>
>>>>>> 2019-04-23 19:09:27,543 DEBUG
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>> Received heartbeat request from 6b7dd7b5032c089bff8a77f75de65c22.
>>>>>>
>>>>>> 2019-04-23 19:09:35,218 DEBUG
>>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>>> Got ping response for sessionid: 0x15cc7f3d88466a5 after 0ms
>>>>>>
>>>>>> 2019-04-23 19:09:37,542 DEBUG
>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>>>>>> Received heartbeat request from 6b7dd7b5032c089bff8a77f75de65c22.
>>>>>>
>>>>>>
>>>>>>
>>>>>> JM log has analogous..
>>>>>>
>>>>>>
>>>>>> 2019-04-23 19:10:49,218 DEBUG
>>>>>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>>>>>> Got ping response for sessionid: 0x25add5478fb2e7c after 0ms
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does that ring a bell ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 23, 2019 at 2:04 PM Vishal Santoshi <
>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>
>>>>>>> Adding the DEBUG  logs from the time I call a REST based cancel with
>>>>>>> save point...
>>>>>>>
>>>>>>> On Tue, Apr 23, 2019 at 2:01 PM Vishal Santoshi <
>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Though looking at
>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/StandaloneResourceManager.java#L88
>>>>>>>>  it
>>>>>>>> does seem that the last log . is expected.
>>>>>>>>
>>>>>>>> Not sure what part is hanging... I have more logs I can share...
>>>>>>>>
>>>>>>>> On Tue, Apr 23, 2019 at 1:59 PM Vishal Santoshi <
>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I am seeing this weird issue where I do a save point with cancel
>>>>>>>>> on a job on k8s and it hangs for 5 minutes ( no INFO logs ) and then 
>>>>>>>>> exits
>>>>>>>>> with code of 2.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,372 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,374 INFO
>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,377 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      -
>>>>>>>>> Suspending SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,378 DEBUG
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Close ResourceManager connection 181a4fd61044033a2ea32e384096247f.
>>>>>>>>>
>>>>>>>>> org.apache.flink.util.FlinkException: JobManager is shutting down.
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:365)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:504)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:170)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>>>>>>>>
>>>>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>>>>>>>
>>>>>>>>> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>>>>>>>>
>>>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>>>>>>>
>>>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>>>>>>>>
>>>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>>>>>>>
>>>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>>>>>>>
>>>>>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,381 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      -
>>>>>>>>> Stopping SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,381 INFO
>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>> - Disconnect job manager a41a5dceae5ad3664ff1f0b79f3e47ef
>>>>>>>>> @akka.tcp://flink@kafka-to-prometheus:6123/user/jobmanager_0 for
>>>>>>>>> job 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:36:31,385 INFO
>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and after 5 minutes ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 019-04-23 17:41:32,187 DEBUG
>>>>>>>>> org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache  -
>>>>>>>>> Freed 8 thread-local buffer(s) from thread: Finalizer
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:41:32,198 INFO
>>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              -
>>>>>>>>> Stopped Akka RPC service.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:41:32,200 INFO  
>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>         - Terminating cluster entrypoint process
>>>>>>>>> StandaloneJobClusterEntryPoint with exit code 2.
>>>>>>>>>
>>>>>>>>> java.util.concurrent.TimeoutException
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:942)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:360)
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the interim, I get this at a regular clip
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:37:02,452 DEBUG
>>>>>>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
>>>>>>>>> Release TaskExecutor 3752235c49428b94a0520f04266973eb because it 
>>>>>>>>> exceeded
>>>>>>>>> the idle timeout.
>>>>>>>>>
>>>>>>>>> 2019-04-23 17:37:02,453 DEBUG
>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>> - Worker 68c5fbd67ac2bbe6fc35ed068ce1c4b1 could not be stopped.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Any ideas as to whether it is this inability to shut down the
>>>>>>>>> Worker that is causing this issue ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>

Re: No zero ( 2 ) exit code on k8s StandaloneJobClusterEntryPoint when save point with cancel...

Reply via email to