BTW, does 1.8 also solve the issue where we can cancel with a save point. That too is broken in 1.7.2
curl --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":*true*}' https://*************/jobs/00000000000000000000000000000000/savepoints On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > Awesome, thanks! > > On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <g...@ververica.com> wrote: > >> The RC artifacts are only deployed to the Maven Central Repository when >> the RC >> is promoted to a release. As written in the 1.8.0 RC1 voting email [1], >> you >> can find the maven artifacts, and the Flink binaries here: >> >> - >> https://repository.apache.org/content/repositories/orgapacheflink-1210/ >> - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/ >> >> Alternatively, you can apply the patch yourself, and build Flink 1.7 from >> sources [2]. On my machine this takes around 10 minutes if tests are >> skipped. >> >> Best, >> Gary >> >> [1] >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >> [2] >> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink >> >> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> Do you have a mvn repository ( at mvn central ) set up for 1,8 release >>> candidate. We could test it for you. >>> >>> Without 1.8and this exit code we are essentially held up. >>> >>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote: >>> >>>> Nobody can tell with 100% certainty. We want to give the RC some >>>> exposure >>>> first, and there is also a release process that is prescribed by the >>>> ASF [1]. >>>> You can look at past releases to get a feeling for how long the release >>>> process lasts [2]. >>>> >>>> [1] http://www.apache.org/legal/release-policy.html#release-approval >>>> [2] >>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0 >>>> >>>> >>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> And when is the 1.8.0 release expected ? >>>>> >>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> :) That makes so much more sense. Is k8s native flink a part of this >>>>>> release ? >>>>>> >>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote: >>>>>> >>>>>>> Hi Vishal, >>>>>>> >>>>>>> This issue was fixed recently [1], and the patch will be released >>>>>>> with 1.8. If >>>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There >>>>>>> is a >>>>>>> release candidate [2], which you can test. >>>>>>> >>>>>>> Best, >>>>>>> Gary >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743 >>>>>>> [2] >>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi < >>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Vijay, >>>>>>>> >>>>>>>> This is the larger issue. The cancellation routine is itself >>>>>>>> broken. >>>>>>>> >>>>>>>> On cancellation flink does remove the checkpoint counter >>>>>>>> >>>>>>>> *2019-03-12 14:12:13,143 >>>>>>>> INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>> - >>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>> ZooKeeper * >>>>>>>> >>>>>>>> but exist with a non zero code >>>>>>>> >>>>>>>> *2019-03-12 14:12:13,477 >>>>>>>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint >>>>>>>> with >>>>>>>> exit code 1444.* >>>>>>>> >>>>>>>> >>>>>>>> That I think is an issue. A cancelled job is a complete job and >>>>>>>> thus the exit code should be 0 for k8s to mark it complete. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar < >>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Yes Vishal. Thats correct. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Bhaskar >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> This really not cool but here you go. This seems to work. Agreed >>>>>>>>>> that this cannot be this painful. The cancel does not exit with an >>>>>>>>>> exit >>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this >>>>>>>>>> align >>>>>>>>>> with what you have had to do ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Take a save point . This returns a request id >>>>>>>>>> >>>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>>> --data >>>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>>>>>>>> >>>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Make sure the save point succeeded >>>>>>>>>> >>>>>>>>>> curl --request GET >>>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - cancel the job >>>>>>>>>> >>>>>>>>>> curl --request PATCH >>>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Delete the job and deployment >>>>>>>>>> >>>>>>>>>> kubectl delete -f >>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>>>>> >>>>>>>>>> kubectl delete -f >>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>>>>>>>> >>>>>>>>>> args: ["job-cluster", >>>>>>>>>> >>>>>>>>>> "--fromSavepoint", >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>>>>>>>> "--job-classname", ......... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Restart >>>>>>>>>> >>>>>>>>>> kubectl create -f >>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>>>>> >>>>>>>>>> kubectl create -f >>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Make sure from the UI, that it restored from the specific >>>>>>>>>> save point. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar < >>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Yes Its supposed to work. But unfortunately it was not working. >>>>>>>>>>> Flink community needs to respond to this behavior. >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> Bhaskar >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Aah. >>>>>>>>>>>> Let me try this out and will get back to you. >>>>>>>>>>>> Though I would assume that save point with cancel is a single >>>>>>>>>>>> atomic step, rather then a save point *followed* by a >>>>>>>>>>>> cancellation ( else why would that be an option ). >>>>>>>>>>>> Thanks again. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Vishal, >>>>>>>>>>>>> >>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for >>>>>>>>>>>>> all clusters. Its recommended command. >>>>>>>>>>>>> >>>>>>>>>>>>> Use the following command to issue save point. >>>>>>>>>>>>> curl --header "Content-Type: application/json" --request >>>>>>>>>>>>> POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>>>>>>>> "cancel-job":false}' \ https:// >>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>>>> >>>>>>>>>>>>> Then issue yarn-cancel. >>>>>>>>>>>>> After that follow the process to restore save point >>>>>>>>>>>>> >>>>>>>>>>>>> Regards >>>>>>>>>>>>> Bhaskar >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello Vijay, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you for the reply. This though is k8s >>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>>>>>>>> lifecycle. >>>>>>>>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>>>>>>>> a straight up >>>>>>>>>>>>>> curl --header "Content-Type: application/json" --request >>>>>>>>>>>>>> POST --data >>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>>>>>>>> \ https:// >>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would assume that after taking the save point, the jvm >>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if >>>>>>>>>>>>>> it is a >>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence >>>>>>>>>>>>>> the pod. It >>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( >>>>>>>>>>>>>> the >>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also >>>>>>>>>>>>>> remove the >>>>>>>>>>>>>> checkpoint counter but does not exit the job. And after a >>>>>>>>>>>>>> little bit the >>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not >>>>>>>>>>>>>> the right >>>>>>>>>>>>>> thing to do ( to me at least ). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and >>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to >>>>>>>>>>>>>> honor the >>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider >>>>>>>>>>>>>> the save >>>>>>>>>>>>>> point. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a >>>>>>>>>>>>>> k8s job cluster deployment seems to be >>>>>>>>>>>>>> >>>>>>>>>>>>>> - cancel with save point as defined hre >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>>>>>>>> - delete the job manger job and task manager deployments >>>>>>>>>>>>>> from k8s almost immediately. >>>>>>>>>>>>>> - clear the ZK chroot for the 0000000...... job and may >>>>>>>>>>>>>> be the checkpoints directory. >>>>>>>>>>>>>> - resumeFromCheckPoint >>>>>>>>>>>>>> >>>>>>>>>>>>>> If some body can say that this indeed is the process ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Logs are attached. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>>>>>>>>>> - Savepoint stored in >>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>>>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>> - Job anomaly_echo (00000000000000000000000000000000) >>>>>>>>>>>>>> switched from state RUNNING to CANCELLING. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>>>>> - Completed checkpoint 10 for job >>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>> - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: >>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched >>>>>>>>>>>>>> from RUNNING >>>>>>>>>>>>>> to CANCELING. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>> - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: >>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched >>>>>>>>>>>>>> from >>>>>>>>>>>>>> CANCELING to CANCELED. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>> - Job anomaly_echo (00000000000000000000000000000000) >>>>>>>>>>>>>> switched from state CANCELLING to CANCELED. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>>>>>>>> 00000000000000000000000000000000. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>>>>> - Shutting down >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>>>>> - Checkpoint with ID 8 at >>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' >>>>>>>>>>>>>> not >>>>>>>>>>>>>> discarded. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>>>>> - Removing >>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>>>>>>>> from ZooKeeper >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>>>>> - Checkpoint with ID 10 at >>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' >>>>>>>>>>>>>> not >>>>>>>>>>>>>> discarded. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>>>>> - Shutting down. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>>>>> - Removing >>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>>>>>>>> ZooKeeper >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher >>>>>>>>>>>>>> - Job 00000000000000000000000000000000 reached globally >>>>>>>>>>>>>> terminal state CANCELED. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>>>>>>>>>> - Stopping the JobMaster for job >>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint >>>>>>>>>>>>>> - Shutting down rest endpoint. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>>>>>>>> /leader/resource_manager_lock. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>>>>>>>>>> - Close ResourceManager connection >>>>>>>>>>>>>> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down.. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>>>>>>>>>>>>> - Suspending SlotPool. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>>>>>>>>>>>>> - Stopping SlotPool. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> After a little bit >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Starting the job-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with >>>>>>>>>>>>>> key `jobmanager.heap.size` >>>>>>>>>>>>>> >>>>>>>>>>>>>> Starting standalonejob as a console application on host >>>>>>>>>>>>>> anomalyecho-mmg6t. >>>>>>>>>>>>>> >>>>>>>>>>>>>> .. >>>>>>>>>>>>>> >>>>>>>>>>>>>> .. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Vishal >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Save point with cancellation internally use /cancel REST >>>>>>>>>>>>>>> API. Which is not stable API. It always exits with 404. Best >>>>>>>>>>>>>>> way to issue >>>>>>>>>>>>>>> is: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> a) First issue save point REST API >>>>>>>>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>>>>>>>> Above is the smoother way >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> Bhaskar >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There are some issues I see and would want to get some >>>>>>>>>>>>>>>> feedback >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , >>>>>>>>>>>>>>>> the k8s job does not exit ( it is not a deployment ) . I >>>>>>>>>>>>>>>> would assume >>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, >>>>>>>>>>>>>>>> and thus the >>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod >>>>>>>>>>>>>>>> remains live. Is >>>>>>>>>>>>>>>> that expected ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to >>>>>>>>>>>>>>>> delete the job id ( 0000000000.... ) from ZooKeeper ( this is >>>>>>>>>>>>>>>> HA ), else >>>>>>>>>>>>>>>> it defaults to the latest checkpoint no matter what >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested >>>>>>>>>>>>>>>> process of cancelling with a save point and resuming and what >>>>>>>>>>>>>>>> is the >>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). >>>>>>>>>>>>>>>> Note that >>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not >>>>>>>>>>>>>>>> make sense, >>>>>>>>>>>>>>>> I still can not provide a new job id. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Vishal. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>