And when is the 1.8.0 release expected ? On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <vishal.santo...@gmail.com> wrote:
> :) That makes so much more sense. Is k8s native flink a part of this > release ? > > On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote: > >> Hi Vishal, >> >> This issue was fixed recently [1], and the patch will be released with >> 1.8. If >> the Flink job gets cancelled, the JVM should exit with code 0. There is a >> release candidate [2], which you can test. >> >> Best, >> Gary >> >> [1] https://issues.apache.org/jira/browse/FLINK-10743 >> [2] >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >> >> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> Thanks Vijay, >>> >>> This is the larger issue. The cancellation routine is itself broken. >>> >>> On cancellation flink does remove the checkpoint counter >>> >>> *2019-03-12 14:12:13,143 >>> INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>> ZooKeeper * >>> >>> but exist with a non zero code >>> >>> *2019-03-12 14:12:13,477 >>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with >>> exit code 1444.* >>> >>> >>> That I think is an issue. A cancelled job is a complete job and thus the >>> exit code should be 0 for k8s to mark it complete. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>> wrote: >>> >>>> Yes Vishal. Thats correct. >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> This really not cool but here you go. This seems to work. Agreed that >>>>> this cannot be this painful. The cancel does not exit with an exit code pf >>>>> 0 and thus the job has to manually delete. Vijay does this align with what >>>>> you have had to do ? >>>>> >>>>> >>>>> - Take a save point . This returns a request id >>>>> >>>>> curl --header "Content-Type: application/json" --request POST --data >>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>>> >>>>> >>>>> >>>>> - Make sure the save point succeeded >>>>> >>>>> curl --request GET >>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>>> >>>>> >>>>> >>>>> - cancel the job >>>>> >>>>> curl --request PATCH >>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>>> >>>>> >>>>> >>>>> - Delete the job and deployment >>>>> >>>>> kubectl delete -f >>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>> >>>>> kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>> >>>>> >>>>> >>>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>>> >>>>> args: ["job-cluster", >>>>> >>>>> "--fromSavepoint", >>>>> >>>>> >>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>>> "--job-classname", ......... >>>>> >>>>> >>>>> >>>>> - Restart >>>>> >>>>> kubectl create -f >>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>> >>>>> kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>> >>>>> >>>>> >>>>> - Make sure from the UI, that it restored from the specific save >>>>> point. >>>>> >>>>> >>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar < >>>>> bhaskar.eba...@gmail.com> wrote: >>>>> >>>>>> Yes Its supposed to work. But unfortunately it was not working. >>>>>> Flink community needs to respond to this behavior. >>>>>> >>>>>> Regards >>>>>> Bhaskar >>>>>> >>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> Aah. >>>>>>> Let me try this out and will get back to you. >>>>>>> Though I would assume that save point with cancel is a single atomic >>>>>>> step, rather then a save point *followed* by a cancellation ( else >>>>>>> why would that be an option ). >>>>>>> Thanks again. >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Vishal, >>>>>>>> >>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>>>>>> clusters. Its recommended command. >>>>>>>> >>>>>>>> Use the following command to issue save point. >>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>>> "cancel-job":false}' \ https:// >>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>> >>>>>>>> Then issue yarn-cancel. >>>>>>>> After that follow the process to restore save point >>>>>>>> >>>>>>>> Regards >>>>>>>> Bhaskar >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hello Vijay, >>>>>>>>> >>>>>>>>> Thank you for the reply. This though is k8s >>>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>>> lifecycle. >>>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>>> a straight up >>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>> --data >>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>>> \ https:// >>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>> >>>>>>>>> I would assume that after taking the save point, the jvm should >>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job >>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It >>>>>>>>> does >>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the >>>>>>>>> JobMaster, >>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the >>>>>>>>> checkpoint >>>>>>>>> counter but does not exit the job. And after a little bit the job is >>>>>>>>> restarted which does not make sense and absolutely not the right >>>>>>>>> thing to >>>>>>>>> do ( to me at least ). >>>>>>>>> >>>>>>>>> Further if I delete the deployment and the job from k8s and >>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the >>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the >>>>>>>>> save >>>>>>>>> point. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job >>>>>>>>> cluster deployment seems to be >>>>>>>>> >>>>>>>>> - cancel with save point as defined hre >>>>>>>>> >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>>> - delete the job manger job and task manager deployments from >>>>>>>>> k8s almost immediately. >>>>>>>>> - clear the ZK chroot for the 0000000...... job and may be >>>>>>>>> the checkpoints directory. >>>>>>>>> - resumeFromCheckPoint >>>>>>>>> >>>>>>>>> If some body can say that this indeed is the process ? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Logs are attached. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>> Savepoint stored in >>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>> state >>>>>>>>> RUNNING to CANCELLING. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>> - Completed checkpoint 10 for job >>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>>> (1/1) >>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>>> (1/1) >>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to >>>>>>>>> CANCELED. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>> state >>>>>>>>> CANCELLING to CANCELED. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>>> 00000000000000000000000000000000. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>> - Shutting down >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>> - Checkpoint with ID 8 at >>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>>>>>> discarded. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>> - Removing >>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>>> from ZooKeeper >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>> - Checkpoint with ID 10 at >>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>>>>>>>> discarded. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>>>> Shutting down. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>>> ZooKeeper >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - >>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state >>>>>>>>> CANCELED. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>> Stopping the JobMaster for job >>>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>>>>>> Shutting down rest endpoint. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>>> /leader/resource_manager_lock. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>>>>> JobManager is shutting down.. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>> Suspending SlotPool. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>> Stopping SlotPool. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>>>> >>>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>>> >>>>>>>>> >>>>>>>>> After a little bit >>>>>>>>> >>>>>>>>> >>>>>>>>> Starting the job-cluster >>>>>>>>> >>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>>>>>>> `jobmanager.heap.size` >>>>>>>>> >>>>>>>>> Starting standalonejob as a console application on host >>>>>>>>> anomalyecho-mmg6t. >>>>>>>>> >>>>>>>>> .. >>>>>>>>> >>>>>>>>> .. >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Vishal >>>>>>>>>> >>>>>>>>>> Save point with cancellation internally use /cancel REST API. >>>>>>>>>> Which is not stable API. It always exits with 404. Best way to >>>>>>>>>> issue is: >>>>>>>>>> >>>>>>>>>> a) First issue save point REST API >>>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>>> ) >>>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>>> Above is the smoother way >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Bhaskar >>>>>>>>>> >>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> There are some issues I see and would want to get some feedback >>>>>>>>>>> >>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the >>>>>>>>>>> k8s job does not exit ( it is not a deployment ) . I would assume >>>>>>>>>>> that on >>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the >>>>>>>>>>> pod >>>>>>>>>>> should too. That does not happen and thus the job pod remains live. >>>>>>>>>>> Is that >>>>>>>>>>> expected ? >>>>>>>>>>> >>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the >>>>>>>>>>> job id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it >>>>>>>>>>> defaults >>>>>>>>>>> to the latest checkpoint no matter what >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested process >>>>>>>>>>> of cancelling with a save point and resuming and what is the >>>>>>>>>>> cogent story >>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id >>>>>>>>>>> does not >>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still >>>>>>>>>>> can not >>>>>>>>>>> provide a new job id. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> Vishal. >>>>>>>>>>> >>>>>>>>>>>