Oh, Yeah this is larger issue indeed :)


> This is the larger issue.  The cancellation routine is itself broken.
> On cancellation flink does remove the checkpoint counter
> *2019-03-12 14:12:13,143
> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/00000000000000000000000000000000 from
> ZooKeeper *
> but exist with a non zero code
> *2019-03-12 14:12:13,477
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1444.*
> That I think is an issue. A cancelled job is a complete job and thus the
> exit code should be 0 for k8s to mark it complete.
>> Yes Vishal. Thats correct.
>>> This really not cool but here you go. This seems to work. Agreed that
>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>> you have had to do ?
>>>    - Take a save point . This returns a request id
>>>    curl  --header "Content-Type: application/json" --request POST --data 
>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>     https://*************/jobs/00000000000000000000000000000000/savepoints
>>>    - Make sure the save point succeeded
>>>    curl  --request GET   
>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>    - cancel the job
>>>    curl  --request PATCH 
>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>    - Delete the job and deployment
>>>    kubectl delete -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>    args: ["job-cluster",
>>>                   "--fromSavepoint",
>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>                   "--job-classname", .........
>>>    - Restart
>>>    kubectl create -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>    - Make sure from the UI, that it restored from the specific save
>>>    point.
>>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>>> community needs to respond to this behavior.
>>>>> Aah.
>>>>> Let me try this out and will get back to you.
>>>>> Though I would assume that save point with cancel is a single atomic
>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>> why would that be an option ).
>>>>> Thanks again.
>>>>>> Hi Vishal,
>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>> clusters. Its recommended command.
>>>>>> Use the following command to issue save point.
>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>> "cancel-job":false}'  \ https://
>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>> Then issue yarn-cancel.
>>>>>> After that  follow the process to restore save point
>>>>>>> Hello Vijay,
>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>> lifecycle.
>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>> a straight up
>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>> --data
>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>> \ https://
>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It 
>>>>>>> does
>>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster,
>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>> restarted which does not make sense and absolutely not the right thing 
>>>>>>> to
>>>>>>> do  ( to me at least ).
>>>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the 
>>>>>>> save
>>>>>>> point.
>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>> cluster deployment  seems to be
>>>>>>>    - cancel with save point as defined hre
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>    k8s almost immediately.
>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>>>    checkpoints directory.
>>>>>>>    - resumeFromCheckPoint
>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>  Logs are attached.
>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Savepoint stored in
>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state 
>>>>>>> RUNNING
>>>>>>> to CANCELLING.
>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>     - Completed checkpoint 10 for job
>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>> (1/1)
>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>> (1/1)
>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>> 00000000000000000000000000000000.
>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>> - Shutting down
>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>       - Checkpoint with ID 8 at
>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>> discarded.
>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>> - Removing
>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>> from ZooKeeper
>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>       - Checkpoint with ID 10 at
>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>> discarded.
>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Shutting down.
>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from 
>>>>>>> ZooKeeper
>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>>>> 00000000000000000000000000000000 reached globally terminal state 
>>>>>>> CANCELED.
>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Stopping the JobMaster for job
>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>> Shutting down rest endpoint.
>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>> /leader/resource_manager_lock.
>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>> JobManager is shutting down..
>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>> Suspending SlotPool.
>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>> Stopping SlotPool.
>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>> After a little bit
>>>>>>> Starting the job-cluster
>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>> `jobmanager.heap.size`
>>>>>>> Starting standalonejob as a console application on host
>>>>>>> anomalyecho-mmg6t.
>>>>>>> ..
>>>>>>> ..
>>>>>>>> Hi Vishal
>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue 
>>>>>>>> is:
>>>>>>>> a) First issue save point REST API
>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>> )
>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>> Above is the smoother way
>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume 
>>>>>>>>> that on
>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>>> should too. That does not happen and thus the job pod remains live. 
>>>>>>>>> Is that
>>>>>>>>> expected ?
>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it 
>>>>>>>>> defaults
>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>>>> cancelling with a save point and resuming  and what is the cogent 
>>>>>>>>> story
>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does 
>>>>>>>>> not
>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can 
>>>>>>>>> not
>>>>>>>>> provide a new job id.
