:) That makes so much more sense. Is k8s native flink a part of this release ?
On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote: > Hi Vishal, > > This issue was fixed recently [1], and the patch will be released with > 1.8. If > the Flink job gets cancelled, the JVM should exit with code 0. There is a > release candidate [2], which you can test. > > Best, > Gary > > [1] https://issues.apache.org/jira/browse/FLINK-10743 > [2] > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html > > On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> Thanks Vijay, >> >> This is the larger issue. The cancellation routine is itself broken. >> >> On cancellation flink does remove the checkpoint counter >> >> *2019-03-12 14:12:13,143 >> INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >> Removing /checkpoint-counter/00000000000000000000000000000000 from >> ZooKeeper * >> >> but exist with a non zero code >> >> *2019-03-12 14:12:13,477 >> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with >> exit code 1444.* >> >> >> That I think is an issue. A cancelled job is a complete job and thus the >> exit code should be 0 for k8s to mark it complete. >> >> >> >> >> >> >> >> >> >> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >> wrote: >> >>> Yes Vishal. Thats correct. >>> >>> Regards >>> Bhaskar >>> >>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> This really not cool but here you go. This seems to work. Agreed that >>>> this cannot be this painful. The cancel does not exit with an exit code pf >>>> 0 and thus the job has to manually delete. Vijay does this align with what >>>> you have had to do ? >>>> >>>> >>>> - Take a save point . This returns a request id >>>> >>>> curl --header "Content-Type: application/json" --request POST --data >>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>> >>>> >>>> >>>> - Make sure the save point succeeded >>>> >>>> curl --request GET >>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>> >>>> >>>> >>>> - cancel the job >>>> >>>> curl --request PATCH >>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>> >>>> >>>> >>>> - Delete the job and deployment >>>> >>>> kubectl delete -f >>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>> >>>> kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>> >>>> >>>> >>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>> >>>> args: ["job-cluster", >>>> >>>> "--fromSavepoint", >>>> >>>> >>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>> "--job-classname", ......... >>>> >>>> >>>> >>>> - Restart >>>> >>>> kubectl create -f >>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>> >>>> kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>> >>>> >>>> >>>> - Make sure from the UI, that it restored from the specific save >>>> point. >>>> >>>> >>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>>> wrote: >>>> >>>>> Yes Its supposed to work. But unfortunately it was not working. Flink >>>>> community needs to respond to this behavior. >>>>> >>>>> Regards >>>>> Bhaskar >>>>> >>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> Aah. >>>>>> Let me try this out and will get back to you. >>>>>> Though I would assume that save point with cancel is a single atomic >>>>>> step, rather then a save point *followed* by a cancellation ( else >>>>>> why would that be an option ). >>>>>> Thanks again. >>>>>> >>>>>> >>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>> >>>>>>> Hi Vishal, >>>>>>> >>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>>>>> clusters. Its recommended command. >>>>>>> >>>>>>> Use the following command to issue save point. >>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>> "cancel-job":false}' \ https:// >>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>> >>>>>>> Then issue yarn-cancel. >>>>>>> After that follow the process to restore save point >>>>>>> >>>>>>> Regards >>>>>>> Bhaskar >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello Vijay, >>>>>>>> >>>>>>>> Thank you for the reply. This though is k8s >>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>> lifecycle. >>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>> a straight up >>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>> --data >>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>> \ https:// >>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>> >>>>>>>> I would assume that after taking the save point, the jvm should >>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job >>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It >>>>>>>> does >>>>>>>> seem to do some things right. It stops a bunch of stuff ( the >>>>>>>> JobMaster, >>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the >>>>>>>> checkpoint >>>>>>>> counter but does not exit the job. And after a little bit the job is >>>>>>>> restarted which does not make sense and absolutely not the right thing >>>>>>>> to >>>>>>>> do ( to me at least ). >>>>>>>> >>>>>>>> Further if I delete the deployment and the job from k8s and restart >>>>>>>> the job and deployment fromSavePoint, it refuses to honor the >>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the >>>>>>>> save >>>>>>>> point. >>>>>>>> >>>>>>>> >>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job >>>>>>>> cluster deployment seems to be >>>>>>>> >>>>>>>> - cancel with save point as defined hre >>>>>>>> >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>> - delete the job manger job and task manager deployments from >>>>>>>> k8s almost immediately. >>>>>>>> - clear the ZK chroot for the 0000000...... job and may be the >>>>>>>> checkpoints directory. >>>>>>>> - resumeFromCheckPoint >>>>>>>> >>>>>>>> If some body can say that this indeed is the process ? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Logs are attached. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>> Savepoint stored in >>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>> >>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state >>>>>>>> RUNNING to CANCELLING. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>> - Completed checkpoint 10 for job >>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>> (1/1) >>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>> (1/1) >>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from state >>>>>>>> CANCELLING to CANCELED. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>> 00000000000000000000000000000000. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>> - Shutting down >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>> - Checkpoint with ID 8 at >>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>>>>> discarded. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>> - Removing >>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>> from ZooKeeper >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>> - Checkpoint with ID 10 at >>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>>>>>>> discarded. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>>> Shutting down. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>> ZooKeeper >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - >>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state >>>>>>>> CANCELED. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>> Stopping the JobMaster for job >>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>>>>> Shutting down rest endpoint. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>> /leader/resource_manager_lock. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>>>> JobManager is shutting down.. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>> Suspending SlotPool. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>> Stopping SlotPool. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>>> >>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>> >>>>>>>> >>>>>>>> After a little bit >>>>>>>> >>>>>>>> >>>>>>>> Starting the job-cluster >>>>>>>> >>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>>>>>> `jobmanager.heap.size` >>>>>>>> >>>>>>>> Starting standalonejob as a console application on host >>>>>>>> anomalyecho-mmg6t. >>>>>>>> >>>>>>>> .. >>>>>>>> >>>>>>>> .. >>>>>>>> >>>>>>>> >>>>>>>> Regards. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Vishal >>>>>>>>> >>>>>>>>> Save point with cancellation internally use /cancel REST API. >>>>>>>>> Which is not stable API. It always exits with 404. Best way to >>>>>>>>> issue is: >>>>>>>>> >>>>>>>>> a) First issue save point REST API >>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>> ) >>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>> Above is the smoother way >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Bhaskar >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> There are some issues I see and would want to get some feedback >>>>>>>>>> >>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the >>>>>>>>>> k8s job does not exit ( it is not a deployment ) . I would assume >>>>>>>>>> that on >>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>>>>>>>>> should too. That does not happen and thus the job pod remains live. >>>>>>>>>> Is that >>>>>>>>>> expected ? >>>>>>>>>> >>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the >>>>>>>>>> job id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it >>>>>>>>>> defaults >>>>>>>>>> to the latest checkpoint no matter what >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested process >>>>>>>>>> of cancelling with a save point and resuming and what is the cogent >>>>>>>>>> story >>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id >>>>>>>>>> does not >>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can >>>>>>>>>> not >>>>>>>>>> provide a new job id. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Vishal. >>>>>>>>>> >>>>>>>>>>