Oh, Yeah this is larger issue indeed :) Regards Bhaskar
On Tue, Mar 12, 2019 at 7:51 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > Thanks Vijay, > > This is the larger issue. The cancellation routine is itself broken. > > On cancellation flink does remove the checkpoint counter > > *2019-03-12 14:12:13,143 > INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - > Removing /checkpoint-counter/00000000000000000000000000000000 from > ZooKeeper * > > but exist with a non zero code > > *2019-03-12 14:12:13,477 > INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with > exit code 1444.* > > > That I think is an issue. A cancelled job is a complete job and thus the > exit code should be 0 for k8s to mark it complete. > > > > > > > > > > On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> > wrote: > >> Yes Vishal. Thats correct. >> >> Regards >> Bhaskar >> >> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> This really not cool but here you go. This seems to work. Agreed that >>> this cannot be this painful. The cancel does not exit with an exit code pf >>> 0 and thus the job has to manually delete. Vijay does this align with what >>> you have had to do ? >>> >>> >>> - Take a save point . This returns a request id >>> >>> curl --header "Content-Type: application/json" --request POST --data >>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>> https://*************/jobs/00000000000000000000000000000000/savepoints >>> >>> >>> >>> - Make sure the save point succeeded >>> >>> curl --request GET >>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>> >>> >>> >>> - cancel the job >>> >>> curl --request PATCH >>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>> >>> >>> >>> - Delete the job and deployment >>> >>> kubectl delete -f >>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>> >>> kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>> >>> >>> >>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>> >>> args: ["job-cluster", >>> >>> "--fromSavepoint", >>> >>> >>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>> "--job-classname", ......... >>> >>> >>> >>> - Restart >>> >>> kubectl create -f >>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>> >>> kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>> >>> >>> >>> - Make sure from the UI, that it restored from the specific save >>> point. >>> >>> >>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>> wrote: >>> >>>> Yes Its supposed to work. But unfortunately it was not working. Flink >>>> community needs to respond to this behavior. >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> Aah. >>>>> Let me try this out and will get back to you. >>>>> Though I would assume that save point with cancel is a single atomic >>>>> step, rather then a save point *followed* by a cancellation ( else >>>>> why would that be an option ). >>>>> Thanks again. >>>>> >>>>> >>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>> bhaskar.eba...@gmail.com> wrote: >>>>> >>>>>> Hi Vishal, >>>>>> >>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>>>> clusters. Its recommended command. >>>>>> >>>>>> Use the following command to issue save point. >>>>>> curl --header "Content-Type: application/json" --request POST >>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>> "cancel-job":false}' \ https:// >>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>> >>>>>> Then issue yarn-cancel. >>>>>> After that follow the process to restore save point >>>>>> >>>>>> Regards >>>>>> Bhaskar >>>>>> >>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> Hello Vijay, >>>>>>> >>>>>>> Thank you for the reply. This though is k8s >>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>> lifecycle. >>>>>>> I issue a* save point with cancel* as documented here >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>> a straight up >>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>> --data >>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>> \ https:// >>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>> >>>>>>> I would assume that after taking the save point, the jvm should >>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job >>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It >>>>>>> does >>>>>>> seem to do some things right. It stops a bunch of stuff ( the JobMaster, >>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the checkpoint >>>>>>> counter but does not exit the job. And after a little bit the job is >>>>>>> restarted which does not make sense and absolutely not the right thing >>>>>>> to >>>>>>> do ( to me at least ). >>>>>>> >>>>>>> Further if I delete the deployment and the job from k8s and restart >>>>>>> the job and deployment fromSavePoint, it refuses to honor the >>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the >>>>>>> save >>>>>>> point. >>>>>>> >>>>>>> >>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job >>>>>>> cluster deployment seems to be >>>>>>> >>>>>>> - cancel with save point as defined hre >>>>>>> >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>> - delete the job manger job and task manager deployments from >>>>>>> k8s almost immediately. >>>>>>> - clear the ZK chroot for the 0000000...... job and may be the >>>>>>> checkpoints directory. >>>>>>> - resumeFromCheckPoint >>>>>>> >>>>>>> If some body can say that this indeed is the process ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Logs are attached. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>> Savepoint stored in >>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>> cancelling 00000000000000000000000000000000. >>>>>>> >>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>>>>> RUNNING >>>>>>> to CANCELLING. >>>>>>> >>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>> - Completed checkpoint 10 for job >>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>> >>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>> (1/1) >>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>>>>>> >>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>> (1/1) >>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. >>>>>>> >>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>>>>> CANCELLING to CANCELED. >>>>>>> >>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>> - Stopping checkpoint coordinator for job >>>>>>> 00000000000000000000000000000000. >>>>>>> >>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - Shutting down >>>>>>> >>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>> - Checkpoint with ID 8 at >>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>>>> discarded. >>>>>>> >>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>> - Removing >>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>> from ZooKeeper >>>>>>> >>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>> - Checkpoint with ID 10 at >>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>>>>>> discarded. >>>>>>> >>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>> Shutting down. >>>>>>> >>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>>> ZooKeeper >>>>>>> >>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job >>>>>>> 00000000000000000000000000000000 reached globally terminal state >>>>>>> CANCELED. >>>>>>> >>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>> Stopping the JobMaster for job >>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>> >>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>> application status CANCELED. Diagnostics null. >>>>>>> >>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>>>> Shutting down rest endpoint. >>>>>>> >>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>> /leader/resource_manager_lock. >>>>>>> >>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>>> JobManager is shutting down.. >>>>>>> >>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>> Suspending SlotPool. >>>>>>> >>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>> Stopping SlotPool. >>>>>>> >>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>> >>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>> >>>>>>> >>>>>>> After a little bit >>>>>>> >>>>>>> >>>>>>> Starting the job-cluster >>>>>>> >>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>>>>> `jobmanager.heap.size` >>>>>>> >>>>>>> Starting standalonejob as a console application on host >>>>>>> anomalyecho-mmg6t. >>>>>>> >>>>>>> .. >>>>>>> >>>>>>> .. >>>>>>> >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Vishal >>>>>>>> >>>>>>>> Save point with cancellation internally use /cancel REST API. >>>>>>>> Which is not stable API. It always exits with 404. Best way to issue >>>>>>>> is: >>>>>>>> >>>>>>>> a) First issue save point REST API >>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>> ) >>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>> Above is the smoother way >>>>>>>> >>>>>>>> Regards >>>>>>>> Bhaskar >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> There are some issues I see and would want to get some feedback >>>>>>>>> >>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the >>>>>>>>> k8s job does not exit ( it is not a deployment ) . I would assume >>>>>>>>> that on >>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>>>>>>>> should too. That does not happen and thus the job pod remains live. >>>>>>>>> Is that >>>>>>>>> expected ? >>>>>>>>> >>>>>>>>> 2. To resume fro a save point it seems that I have to delete the >>>>>>>>> job id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it >>>>>>>>> defaults >>>>>>>>> to the latest checkpoint no matter what >>>>>>>>> >>>>>>>>> >>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested process of >>>>>>>>> cancelling with a save point and resuming and what is the cogent >>>>>>>>> story >>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does >>>>>>>>> not >>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can >>>>>>>>> not >>>>>>>>> provide a new job id. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Vishal. >>>>>>>>> >>>>>>>>>