Hi Vishal, yarn-cancel doesn't mean to be for yarn cluster. It works for all clusters. Its recommended command.
Use the following command to issue save point. curl --header "Content-Type: application/json" --request POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}' \ https:// ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints Then issue yarn-cancel. After that follow the process to restore save point Regards Bhaskar On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > Hello Vijay, > > Thank you for the reply. This though is k8s deployment ( > rather then yarn ) but may be they follow the same lifecycle. I issue a* > save point with cancel* as documented here > https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, > a straight up > curl --header "Content-Type: application/json" --request POST --data > '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' > \ https:// > ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints > > I would assume that after taking the save point, the jvm should exit, > after all the k8s deployment is of kind: job and if it is a job cluster > then a cancellation should exit the jvm and hence the pod. It does seem to > do some things right. It stops a bunch of stuff ( the JobMaster, the > slotPol, zookeeper coordinator etc ) . It also remove the checkpoint > counter but does not exit the job. And after a little bit the job is > restarted which does not make sense and absolutely not the right thing to > do ( to me at least ). > > Further if I delete the deployment and the job from k8s and restart the > job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I > have to delete the zk chroot for it to consider the save point. > > > Thus the process of cancelling and resuming from a SP on a k8s job cluster > deployment seems to be > > - cancel with save point as defined hre > > https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints > - delete the job manger job and task manager deployments from k8s > almost immediately. > - clear the ZK chroot for the 0000000...... job and may be the > checkpoints directory. > - resumeFromCheckPoint > > If some body can say that this indeed is the process ? > > > > Logs are attached. > > > > 2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Savepoint stored in > hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now > cancelling 00000000000000000000000000000000. > > 2019-03-12 08:10:43,871 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > anomaly_echo (00000000000000000000000000000000) switched from state RUNNING > to CANCELLING. > > 2019-03-12 08:10:44,227 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > - Completed checkpoint 10 for job 00000000000000000000000000000000 > (7238 bytes in 311 ms). > > 2019-03-12 08:10:44,232 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) > (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. > > 2019-03-12 08:10:44,274 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) > (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. > > 2019-03-12 08:10:44,276 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > anomaly_echo (00000000000000000000000000000000) switched from state > CANCELLING to CANCELED. > > 2019-03-12 08:10:44,276 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > - Stopping checkpoint coordinator for job > 00000000000000000000000000000000. > > 2019-03-12 08:10:44,277 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Shutting down > > 2019-03-12 08:10:44,323 INFO > org.apache.flink.runtime.checkpoint.CompletedCheckpoint > - Checkpoint with ID 8 at > 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not > discarded. > > 2019-03-12 08:10:44,437 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Removing > /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 > from ZooKeeper > > 2019-03-12 08:10:44,437 INFO > org.apache.flink.runtime.checkpoint.CompletedCheckpoint > - Checkpoint with ID 10 at > 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not > discarded. > > 2019-03-12 08:10:44,447 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - > Shutting down. > > 2019-03-12 08:10:44,447 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - > Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper > > 2019-03-12 08:10:44,463 INFO > org.apache.flink.runtime.dispatcher.MiniDispatcher - Job > 00000000000000000000000000000000 reached globally terminal state CANCELED. > > 2019-03-12 08:10:44,467 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Stopping the JobMaster for job > anomaly_echo(00000000000000000000000000000000). > > 2019-03-12 08:10:44,468 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint > - Shutting StandaloneJobClusterEntryPoint down with application > status CANCELED. Diagnostics null. > > 2019-03-12 08:10:44,468 INFO > org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting > down rest endpoint. > > 2019-03-12 08:10:44,473 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > > 2019-03-12 08:10:44,475 INFO org.apache.flink.runtime.jobmaster.JobMaster > - Close ResourceManager connection > d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down.. > > 2019-03-12 08:10:44,475 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - > Suspending SlotPool. > > 2019-03-12 08:10:44,476 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping > SlotPool. > > 2019-03-12 08:10:44,476 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - > Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca > @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job > 00000000000000000000000000000000 from the resource manager. > > 2019-03-12 08:10:44,477 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Stopping ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. > > > After a little bit > > > Starting the job-cluster > > used deprecated key `jobmanager.heap.mb`, please replace with key > `jobmanager.heap.size` > > Starting standalonejob as a console application on host anomalyecho-mmg6t. > > .. > > .. > > > Regards. > > > > > > On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> > wrote: > >> Hi Vishal >> >> Save point with cancellation internally use /cancel REST API. Which is >> not stable API. It always exits with 404. Best way to issue is: >> >> a) First issue save point REST API >> b) Then issue /yarn-cancel rest API( As described in >> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >> ) >> c) Then After resuming your job, provide save point Path as argument for >> the run jar REST API, which is returned by the (a) >> Above is the smoother way >> >> Regards >> Bhaskar >> >> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> There are some issues I see and would want to get some feedback >>> >>> 1. On Cancellation With SavePoint with a Target Directory , the k8s >>> job does not exit ( it is not a deployment ) . I would assume that on >>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>> should too. That does not happen and thus the job pod remains live. Is that >>> expected ? >>> >>> 2. To resume fro a save point it seems that I have to delete the job id >>> ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults to the >>> latest checkpoint no matter what >>> >>> >>> I am kind of curious as to what in 1.7.2 is the tested process of >>> cancelling with a save point and resuming and what is the cogent story >>> around job id ( defaults to 000000000000.. ). Note that --job-id does not >>> work with 1.7.2 so even though that does not make sense, I still can not >>> provide a new job id. >>> >>> Regards, >>> >>> Vishal. >>> >>>