This really not cool but here you go. This seems to work. Agreed that this cannot be this painful. The cancel does not exit with an exit code pf 0 and thus the job has to manually delete. Vijay does this align with what you have had to do ?
- Take a save point . This returns a request id curl --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' https://*************/jobs/00000000000000000000000000000000/savepoints - Make sure the save point succeeded curl --request GET https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 - cancel the job curl --request PATCH https://***************/jobs/00000000000000000000000000000000?mode=cancel - Delete the job and deployment kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml - Edit the job-cluster-job-deployment.yaml. Add/Edit args: ["job-cluster", "--fromSavepoint", "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", "--job-classname", ......... - Restart kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml - Make sure from the UI, that it restored from the specific save point. On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> wrote: > Yes Its supposed to work. But unfortunately it was not working. Flink > community needs to respond to this behavior. > > Regards > Bhaskar > > On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> Aah. >> Let me try this out and will get back to you. >> Though I would assume that save point with cancel is a single atomic >> step, rather then a save point *followed* by a cancellation ( else why >> would that be an option ). >> Thanks again. >> >> >> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >> wrote: >> >>> Hi Vishal, >>> >>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>> clusters. Its recommended command. >>> >>> Use the following command to issue save point. >>> curl --header "Content-Type: application/json" --request POST --data >>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}' >>> \ https:// >>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>> >>> Then issue yarn-cancel. >>> After that follow the process to restore save point >>> >>> Regards >>> Bhaskar >>> >>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> Hello Vijay, >>>> >>>> Thank you for the reply. This though is k8s deployment ( >>>> rather then yarn ) but may be they follow the same lifecycle. I issue a* >>>> save point with cancel* as documented here >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>> a straight up >>>> curl --header "Content-Type: application/json" --request POST --data >>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>> \ https:// >>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>> >>>> I would assume that after taking the save point, the jvm should exit, >>>> after all the k8s deployment is of kind: job and if it is a job cluster >>>> then a cancellation should exit the jvm and hence the pod. It does seem to >>>> do some things right. It stops a bunch of stuff ( the JobMaster, the >>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint >>>> counter but does not exit the job. And after a little bit the job is >>>> restarted which does not make sense and absolutely not the right thing to >>>> do ( to me at least ). >>>> >>>> Further if I delete the deployment and the job from k8s and restart the >>>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I >>>> have to delete the zk chroot for it to consider the save point. >>>> >>>> >>>> Thus the process of cancelling and resuming from a SP on a k8s job >>>> cluster deployment seems to be >>>> >>>> - cancel with save point as defined hre >>>> >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>> - delete the job manger job and task manager deployments from k8s >>>> almost immediately. >>>> - clear the ZK chroot for the 0000000...... job and may be the >>>> checkpoints directory. >>>> - resumeFromCheckPoint >>>> >>>> If some body can say that this indeed is the process ? >>>> >>>> >>>> >>>> Logs are attached. >>>> >>>> >>>> >>>> 2019-03-12 08:10:43,871 INFO >>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>> Savepoint stored in >>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>> cancelling 00000000000000000000000000000000. >>>> >>>> 2019-03-12 08:10:43,871 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING >>>> to CANCELLING. >>>> >>>> 2019-03-12 08:10:44,227 INFO >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>> - Completed checkpoint 10 for job 00000000000000000000000000000000 >>>> (7238 bytes in 311 ms). >>>> >>>> 2019-03-12 08:10:44,232 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>>> >>>> 2019-03-12 08:10:44,274 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. >>>> >>>> 2019-03-12 08:10:44,276 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>> CANCELLING to CANCELED. >>>> >>>> 2019-03-12 08:10:44,276 INFO >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>> - Stopping checkpoint coordinator for job >>>> 00000000000000000000000000000000. >>>> >>>> 2019-03-12 08:10:44,277 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - >>>> Shutting down >>>> >>>> 2019-03-12 08:10:44,323 INFO >>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>> - Checkpoint with ID 8 at >>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>> discarded. >>>> >>>> 2019-03-12 08:10:44,437 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - >>>> Removing >>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>> from ZooKeeper >>>> >>>> 2019-03-12 08:10:44,437 INFO >>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>> - Checkpoint with ID 10 at >>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>>> discarded. >>>> >>>> 2019-03-12 08:10:44,447 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>> Shutting down. >>>> >>>> 2019-03-12 08:10:44,447 INFO >>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>> ZooKeeper >>>> >>>> 2019-03-12 08:10:44,463 INFO >>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job >>>> 00000000000000000000000000000000 reached globally terminal state CANCELED. >>>> >>>> 2019-03-12 08:10:44,467 INFO >>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>> Stopping the JobMaster for job >>>> anomaly_echo(00000000000000000000000000000000). >>>> >>>> 2019-03-12 08:10:44,468 INFO >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>> - Shutting StandaloneJobClusterEntryPoint down with >>>> application status CANCELED. Diagnostics null. >>>> >>>> 2019-03-12 08:10:44,468 INFO >>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>> Shutting down rest endpoint. >>>> >>>> 2019-03-12 08:10:44,473 INFO >>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>> - Stopping ZooKeeperLeaderRetrievalService >>>> /leader/resource_manager_lock. >>>> >>>> 2019-03-12 08:10:44,475 INFO >>>> org.apache.flink.runtime.jobmaster.JobMaster - Close >>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is >>>> shutting down.. >>>> >>>> 2019-03-12 08:10:44,475 INFO >>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>> Suspending SlotPool. >>>> >>>> 2019-03-12 08:10:44,476 INFO >>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>> Stopping SlotPool. >>>> >>>> 2019-03-12 08:10:44,476 INFO >>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>> 00000000000000000000000000000000 from the resource manager. >>>> >>>> 2019-03-12 08:10:44,477 INFO >>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>> - Stopping ZooKeeperLeaderElectionService >>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>> >>>> >>>> After a little bit >>>> >>>> >>>> Starting the job-cluster >>>> >>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>> `jobmanager.heap.size` >>>> >>>> Starting standalonejob as a console application on host >>>> anomalyecho-mmg6t. >>>> >>>> .. >>>> >>>> .. >>>> >>>> >>>> Regards. >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>>> wrote: >>>> >>>>> Hi Vishal >>>>> >>>>> Save point with cancellation internally use /cancel REST API. Which >>>>> is not stable API. It always exits with 404. Best way to issue is: >>>>> >>>>> a) First issue save point REST API >>>>> b) Then issue /yarn-cancel rest API( As described in >>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>> ) >>>>> c) Then After resuming your job, provide save point Path as argument >>>>> for the run jar REST API, which is returned by the (a) >>>>> Above is the smoother way >>>>> >>>>> Regards >>>>> Bhaskar >>>>> >>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> There are some issues I see and would want to get some feedback >>>>>> >>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s >>>>>> job does not exit ( it is not a deployment ) . I would assume that on >>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>>>>> should too. That does not happen and thus the job pod remains live. Is >>>>>> that >>>>>> expected ? >>>>>> >>>>>> 2. To resume fro a save point it seems that I have to delete the job >>>>>> id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults to >>>>>> the latest checkpoint no matter what >>>>>> >>>>>> >>>>>> I am kind of curious as to what in 1.7.2 is the tested process of >>>>>> cancelling with a save point and resuming and what is the cogent story >>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not >>>>>> work with 1.7.2 so even though that does not make sense, I still can not >>>>>> provide a new job id. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Vishal. >>>>>> >>>>>>