Awesome, thanks! On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <g...@ververica.com> wrote:
> The RC artifacts are only deployed to the Maven Central Repository when > the RC > is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you > can find the maven artifacts, and the Flink binaries here: > > - > https://repository.apache.org/content/repositories/orgapacheflink-1210/ > - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/ > > Alternatively, you can apply the patch yourself, and build Flink 1.7 from > sources [2]. On my machine this takes around 10 minutes if tests are > skipped. > > Best, > Gary > > [1] > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink > > On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> Do you have a mvn repository ( at mvn central ) set up for 1,8 release >> candidate. We could test it for you. >> >> Without 1.8and this exit code we are essentially held up. >> >> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote: >> >>> Nobody can tell with 100% certainty. We want to give the RC some exposure >>> first, and there is also a release process that is prescribed by the ASF >>> [1]. >>> You can look at past releases to get a feeling for how long the release >>> process lasts [2]. >>> >>> [1] http://www.apache.org/legal/release-policy.html#release-approval >>> [2] >>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0 >>> >>> >>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> And when is the 1.8.0 release expected ? >>>> >>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> :) That makes so much more sense. Is k8s native flink a part of this >>>>> release ? >>>>> >>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote: >>>>> >>>>>> Hi Vishal, >>>>>> >>>>>> This issue was fixed recently [1], and the patch will be released >>>>>> with 1.8. If >>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There >>>>>> is a >>>>>> release candidate [2], which you can test. >>>>>> >>>>>> Best, >>>>>> Gary >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743 >>>>>> [2] >>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >>>>>> >>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Vijay, >>>>>>> >>>>>>> This is the larger issue. The cancellation routine is itself broken. >>>>>>> >>>>>>> On cancellation flink does remove the checkpoint counter >>>>>>> >>>>>>> *2019-03-12 14:12:13,143 >>>>>>> INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>> - >>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>>> ZooKeeper * >>>>>>> >>>>>>> but exist with a non zero code >>>>>>> >>>>>>> *2019-03-12 14:12:13,477 >>>>>>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint >>>>>>> with >>>>>>> exit code 1444.* >>>>>>> >>>>>>> >>>>>>> That I think is an issue. A cancelled job is a complete job and thus >>>>>>> the exit code should be 0 for k8s to mark it complete. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar < >>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>> >>>>>>>> Yes Vishal. Thats correct. >>>>>>>> >>>>>>>> Regards >>>>>>>> Bhaskar >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> This really not cool but here you go. This seems to work. Agreed >>>>>>>>> that this cannot be this painful. The cancel does not exit with an >>>>>>>>> exit >>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this >>>>>>>>> align >>>>>>>>> with what you have had to do ? >>>>>>>>> >>>>>>>>> >>>>>>>>> - Take a save point . This returns a request id >>>>>>>>> >>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>> --data >>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>>>>>>> >>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> - Make sure the save point succeeded >>>>>>>>> >>>>>>>>> curl --request GET >>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> - cancel the job >>>>>>>>> >>>>>>>>> curl --request PATCH >>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> - Delete the job and deployment >>>>>>>>> >>>>>>>>> kubectl delete -f >>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>>>> >>>>>>>>> kubectl delete -f >>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>>>>>>> >>>>>>>>> args: ["job-cluster", >>>>>>>>> >>>>>>>>> "--fromSavepoint", >>>>>>>>> >>>>>>>>> >>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>>>>>>> "--job-classname", ......... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> - Restart >>>>>>>>> >>>>>>>>> kubectl create -f >>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>>>> >>>>>>>>> kubectl create -f >>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> - Make sure from the UI, that it restored from the specific >>>>>>>>> save point. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar < >>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Yes Its supposed to work. But unfortunately it was not working. >>>>>>>>>> Flink community needs to respond to this behavior. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Bhaskar >>>>>>>>>> >>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Aah. >>>>>>>>>>> Let me try this out and will get back to you. >>>>>>>>>>> Though I would assume that save point with cancel is a single >>>>>>>>>>> atomic step, rather then a save point *followed* by a >>>>>>>>>>> cancellation ( else why would that be an option ). >>>>>>>>>>> Thanks again. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Vishal, >>>>>>>>>>>> >>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for >>>>>>>>>>>> all clusters. Its recommended command. >>>>>>>>>>>> >>>>>>>>>>>> Use the following command to issue save point. >>>>>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>>>>>>> "cancel-job":false}' \ https:// >>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>>> >>>>>>>>>>>> Then issue yarn-cancel. >>>>>>>>>>>> After that follow the process to restore save point >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> Bhaskar >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello Vijay, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for the reply. This though is k8s >>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>>>>>>> lifecycle. >>>>>>>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>>>>>>> a straight up >>>>>>>>>>>>> curl --header "Content-Type: application/json" --request >>>>>>>>>>>>> POST --data >>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>>>>>>> \ https:// >>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>>>> >>>>>>>>>>>>> I would assume that after taking the save point, the jvm >>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if >>>>>>>>>>>>> it is a >>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence the >>>>>>>>>>>>> pod. It >>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( the >>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also >>>>>>>>>>>>> remove the >>>>>>>>>>>>> checkpoint counter but does not exit the job. And after a little >>>>>>>>>>>>> bit the >>>>>>>>>>>>> job is restarted which does not make sense and absolutely not the >>>>>>>>>>>>> right >>>>>>>>>>>>> thing to do ( to me at least ). >>>>>>>>>>>>> >>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and >>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor >>>>>>>>>>>>> the >>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider >>>>>>>>>>>>> the save >>>>>>>>>>>>> point. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s >>>>>>>>>>>>> job cluster deployment seems to be >>>>>>>>>>>>> >>>>>>>>>>>>> - cancel with save point as defined hre >>>>>>>>>>>>> >>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>>>>>>> - delete the job manger job and task manager deployments >>>>>>>>>>>>> from k8s almost immediately. >>>>>>>>>>>>> - clear the ZK chroot for the 0000000...... job and may >>>>>>>>>>>>> be the checkpoints directory. >>>>>>>>>>>>> - resumeFromCheckPoint >>>>>>>>>>>>> >>>>>>>>>>>>> If some body can say that this indeed is the process ? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Logs are attached. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>>>>> Savepoint stored in >>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>>>>>> state >>>>>>>>>>>>> RUNNING to CANCELLING. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>>>> - Completed checkpoint 10 for job >>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging >>>>>>>>>>>>> Sink (1/1) >>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to >>>>>>>>>>>>> CANCELING. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging >>>>>>>>>>>>> Sink (1/1) >>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to >>>>>>>>>>>>> CANCELED. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>>>>>> state >>>>>>>>>>>>> CANCELLING to CANCELED. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>>>>>>> 00000000000000000000000000000000. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>>>> - Shutting down >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>>>> - Checkpoint with ID 8 at >>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' >>>>>>>>>>>>> not >>>>>>>>>>>>> discarded. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>>>> - Removing >>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>>>>>>> from ZooKeeper >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>>>> - Checkpoint with ID 10 at >>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' >>>>>>>>>>>>> not >>>>>>>>>>>>> discarded. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>>>> - Shutting down. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>>>> - Removing >>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>>>>>>> ZooKeeper >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - >>>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal >>>>>>>>>>>>> state >>>>>>>>>>>>> CANCELED. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>>>>> Stopping the JobMaster for job >>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint >>>>>>>>>>>>> - Shutting down rest endpoint. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>>>>>>> /leader/resource_manager_lock. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>>>>>>>>> JobManager is shutting down.. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>>>>>> Suspending SlotPool. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>>>>>> Stopping SlotPool. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> After a little bit >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Starting the job-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with >>>>>>>>>>>>> key `jobmanager.heap.size` >>>>>>>>>>>>> >>>>>>>>>>>>> Starting standalonejob as a console application on host >>>>>>>>>>>>> anomalyecho-mmg6t. >>>>>>>>>>>>> >>>>>>>>>>>>> .. >>>>>>>>>>>>> >>>>>>>>>>>>> .. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Regards. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Vishal >>>>>>>>>>>>>> >>>>>>>>>>>>>> Save point with cancellation internally use /cancel REST >>>>>>>>>>>>>> API. Which is not stable API. It always exits with 404. Best >>>>>>>>>>>>>> way to issue >>>>>>>>>>>>>> is: >>>>>>>>>>>>>> >>>>>>>>>>>>>> a) First issue save point REST API >>>>>>>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>>>>>>> ) >>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>>>>>>> Above is the smoother way >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> Bhaskar >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> There are some issues I see and would want to get some >>>>>>>>>>>>>>> feedback >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , >>>>>>>>>>>>>>> the k8s job does not exit ( it is not a deployment ) . I >>>>>>>>>>>>>>> would assume >>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, >>>>>>>>>>>>>>> and thus the >>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod >>>>>>>>>>>>>>> remains live. Is >>>>>>>>>>>>>>> that expected ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete >>>>>>>>>>>>>>> the job id ( 0000000000.... ) from ZooKeeper ( this is HA ), >>>>>>>>>>>>>>> else it >>>>>>>>>>>>>>> defaults to the latest checkpoint no matter what >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested >>>>>>>>>>>>>>> process of cancelling with a save point and resuming and what >>>>>>>>>>>>>>> is the >>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not >>>>>>>>>>>>>>> make sense, >>>>>>>>>>>>>>> I still can not provide a new job id. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Vishal. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>