Yes Vishal. Thats correct. Regards Bhaskar
On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > This really not cool but here you go. This seems to work. Agreed that this > cannot be this painful. The cancel does not exit with an exit code pf 0 and > thus the job has to manually delete. Vijay does this align with what you > have had to do ? > > > - Take a save point . This returns a request id > > curl --header "Content-Type: application/json" --request POST --data > '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' > https://*************/jobs/00000000000000000000000000000000/savepoints > > > > - Make sure the save point succeeded > > curl --request GET > https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 > > > > - cancel the job > > curl --request PATCH > https://***************/jobs/00000000000000000000000000000000?mode=cancel > > > > - Delete the job and deployment > > kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml > > kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml > > > > - Edit the job-cluster-job-deployment.yaml. Add/Edit > > args: ["job-cluster", > > "--fromSavepoint", > > > "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", > "--job-classname", ......... > > > > - Restart > > kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml > > kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml > > > > - Make sure from the UI, that it restored from the specific save point. > > > On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> > wrote: > >> Yes Its supposed to work. But unfortunately it was not working. Flink >> community needs to respond to this behavior. >> >> Regards >> Bhaskar >> >> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> Aah. >>> Let me try this out and will get back to you. >>> Though I would assume that save point with cancel is a single atomic >>> step, rather then a save point *followed* by a cancellation ( else why >>> would that be an option ). >>> Thanks again. >>> >>> >>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>> wrote: >>> >>>> Hi Vishal, >>>> >>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>> clusters. Its recommended command. >>>> >>>> Use the following command to issue save point. >>>> curl --header "Content-Type: application/json" --request POST --data >>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>> "cancel-job":false}' \ https:// >>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>> >>>> Then issue yarn-cancel. >>>> After that follow the process to restore save point >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> Hello Vijay, >>>>> >>>>> Thank you for the reply. This though is k8s deployment >>>>> ( rather then yarn ) but may be they follow the same lifecycle. I issue >>>>> a* >>>>> save point with cancel* as documented here >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>> a straight up >>>>> curl --header "Content-Type: application/json" --request POST --data >>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>> \ https:// >>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>> >>>>> I would assume that after taking the save point, the jvm should exit, >>>>> after all the k8s deployment is of kind: job and if it is a job cluster >>>>> then a cancellation should exit the jvm and hence the pod. It does seem to >>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the >>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint >>>>> counter but does not exit the job. And after a little bit the job is >>>>> restarted which does not make sense and absolutely not the right thing to >>>>> do ( to me at least ). >>>>> >>>>> Further if I delete the deployment and the job from k8s and restart >>>>> the job and deployment fromSavePoint, it refuses to honor the >>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save >>>>> point. >>>>> >>>>> >>>>> Thus the process of cancelling and resuming from a SP on a k8s job >>>>> cluster deployment seems to be >>>>> >>>>> - cancel with save point as defined hre >>>>> >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>> - delete the job manger job and task manager deployments from k8s >>>>> almost immediately. >>>>> - clear the ZK chroot for the 0000000...... job and may be the >>>>> checkpoints directory. >>>>> - resumeFromCheckPoint >>>>> >>>>> If some body can say that this indeed is the process ? >>>>> >>>>> >>>>> >>>>> Logs are attached. >>>>> >>>>> >>>>> >>>>> 2019-03-12 08:10:43,871 INFO >>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>> Savepoint stored in >>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>> cancelling 00000000000000000000000000000000. >>>>> >>>>> 2019-03-12 08:10:43,871 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>>> RUNNING >>>>> to CANCELLING. >>>>> >>>>> 2019-03-12 08:10:44,227 INFO >>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>> - Completed checkpoint 10 for job >>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>> >>>>> 2019-03-12 08:10:44,232 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>>>> >>>>> 2019-03-12 08:10:44,274 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. >>>>> >>>>> 2019-03-12 08:10:44,276 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>>> CANCELLING to CANCELED. >>>>> >>>>> 2019-03-12 08:10:44,276 INFO >>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>> - Stopping checkpoint coordinator for job >>>>> 00000000000000000000000000000000. >>>>> >>>>> 2019-03-12 08:10:44,277 INFO >>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>> - Shutting down >>>>> >>>>> 2019-03-12 08:10:44,323 INFO >>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>> - Checkpoint with ID 8 at >>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>> discarded. >>>>> >>>>> 2019-03-12 08:10:44,437 INFO >>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>> - Removing >>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>> from ZooKeeper >>>>> >>>>> 2019-03-12 08:10:44,437 INFO >>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>> - Checkpoint with ID 10 at >>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>>>> discarded. >>>>> >>>>> 2019-03-12 08:10:44,447 INFO >>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>> Shutting down. >>>>> >>>>> 2019-03-12 08:10:44,447 INFO >>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>> ZooKeeper >>>>> >>>>> 2019-03-12 08:10:44,463 INFO >>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job >>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED. >>>>> >>>>> 2019-03-12 08:10:44,467 INFO >>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>> Stopping the JobMaster for job >>>>> anomaly_echo(00000000000000000000000000000000). >>>>> >>>>> 2019-03-12 08:10:44,468 INFO >>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>> application status CANCELED. Diagnostics null. >>>>> >>>>> 2019-03-12 08:10:44,468 INFO >>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>> Shutting down rest endpoint. >>>>> >>>>> 2019-03-12 08:10:44,473 INFO >>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>> /leader/resource_manager_lock. >>>>> >>>>> 2019-03-12 08:10:44,475 INFO >>>>> org.apache.flink.runtime.jobmaster.JobMaster - Close >>>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is >>>>> shutting down.. >>>>> >>>>> 2019-03-12 08:10:44,475 INFO >>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>> Suspending SlotPool. >>>>> >>>>> 2019-03-12 08:10:44,476 INFO >>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>> Stopping SlotPool. >>>>> >>>>> 2019-03-12 08:10:44,476 INFO >>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>> 00000000000000000000000000000000 from the resource manager. >>>>> >>>>> 2019-03-12 08:10:44,477 INFO >>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>> - Stopping ZooKeeperLeaderElectionService >>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>> >>>>> >>>>> After a little bit >>>>> >>>>> >>>>> Starting the job-cluster >>>>> >>>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>>> `jobmanager.heap.size` >>>>> >>>>> Starting standalonejob as a console application on host >>>>> anomalyecho-mmg6t. >>>>> >>>>> .. >>>>> >>>>> .. >>>>> >>>>> >>>>> Regards. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>> bhaskar.eba...@gmail.com> wrote: >>>>> >>>>>> Hi Vishal >>>>>> >>>>>> Save point with cancellation internally use /cancel REST API. Which >>>>>> is not stable API. It always exits with 404. Best way to issue is: >>>>>> >>>>>> a) First issue save point REST API >>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>> ) >>>>>> c) Then After resuming your job, provide save point Path as argument >>>>>> for the run jar REST API, which is returned by the (a) >>>>>> Above is the smoother way >>>>>> >>>>>> Regards >>>>>> Bhaskar >>>>>> >>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> There are some issues I see and would want to get some feedback >>>>>>> >>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s >>>>>>> job does not exit ( it is not a deployment ) . I would assume that on >>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>>>>>> should too. That does not happen and thus the job pod remains live. Is >>>>>>> that >>>>>>> expected ? >>>>>>> >>>>>>> 2. To resume fro a save point it seems that I have to delete the job >>>>>>> id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults >>>>>>> to >>>>>>> the latest checkpoint no matter what >>>>>>> >>>>>>> >>>>>>> I am kind of curious as to what in 1.7.2 is the tested process of >>>>>>> cancelling with a save point and resuming and what is the cogent story >>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does >>>>>>> not >>>>>>> work with 1.7.2 so even though that does not make sense, I still can not >>>>>>> provide a new job id. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Vishal. >>>>>>> >>>>>>>