Nobody can tell with 100% certainty. We want to give the RC some exposure first, and there is also a release process that is prescribed by the ASF [1]. You can look at past releases to get a feeling for how long the release process lasts [2].
[1] http://www.apache.org/legal/release-policy.html#release-approval [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0 On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > And when is the 1.8.0 release expected ? > > On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> :) That makes so much more sense. Is k8s native flink a part of this >> release ? >> >> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote: >> >>> Hi Vishal, >>> >>> This issue was fixed recently [1], and the patch will be released with >>> 1.8. If >>> the Flink job gets cancelled, the JVM should exit with code 0. There is a >>> release candidate [2], which you can test. >>> >>> Best, >>> Gary >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10743 >>> [2] >>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >>> >>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> Thanks Vijay, >>>> >>>> This is the larger issue. The cancellation routine is itself broken. >>>> >>>> On cancellation flink does remove the checkpoint counter >>>> >>>> *2019-03-12 14:12:13,143 >>>> INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>> ZooKeeper * >>>> >>>> but exist with a non zero code >>>> >>>> *2019-03-12 14:12:13,477 >>>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with >>>> exit code 1444.* >>>> >>>> >>>> That I think is an issue. A cancelled job is a complete job and thus >>>> the exit code should be 0 for k8s to mark it complete. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar < >>>> bhaskar.eba...@gmail.com> wrote: >>>> >>>>> Yes Vishal. Thats correct. >>>>> >>>>> Regards >>>>> Bhaskar >>>>> >>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> This really not cool but here you go. This seems to work. Agreed that >>>>>> this cannot be this painful. The cancel does not exit with an exit code >>>>>> pf >>>>>> 0 and thus the job has to manually delete. Vijay does this align with >>>>>> what >>>>>> you have had to do ? >>>>>> >>>>>> >>>>>> - Take a save point . This returns a request id >>>>>> >>>>>> curl --header "Content-Type: application/json" --request POST --data >>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>>>> >>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>>>> >>>>>> >>>>>> >>>>>> - Make sure the save point succeeded >>>>>> >>>>>> curl --request GET >>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>>>> >>>>>> >>>>>> >>>>>> - cancel the job >>>>>> >>>>>> curl --request PATCH >>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>>>> >>>>>> >>>>>> >>>>>> - Delete the job and deployment >>>>>> >>>>>> kubectl delete -f >>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>> >>>>>> kubectl delete -f >>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>> >>>>>> >>>>>> >>>>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>>>> >>>>>> args: ["job-cluster", >>>>>> >>>>>> "--fromSavepoint", >>>>>> >>>>>> >>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>>>> "--job-classname", ......... >>>>>> >>>>>> >>>>>> >>>>>> - Restart >>>>>> >>>>>> kubectl create -f >>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>> >>>>>> kubectl create -f >>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>> >>>>>> >>>>>> >>>>>> - Make sure from the UI, that it restored from the specific save >>>>>> point. >>>>>> >>>>>> >>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar < >>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>> >>>>>>> Yes Its supposed to work. But unfortunately it was not working. >>>>>>> Flink community needs to respond to this behavior. >>>>>>> >>>>>>> Regards >>>>>>> Bhaskar >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>> >>>>>>>> Aah. >>>>>>>> Let me try this out and will get back to you. >>>>>>>> Though I would assume that save point with cancel is a single >>>>>>>> atomic step, rather then a save point *followed* by a >>>>>>>> cancellation ( else why would that be an option ). >>>>>>>> Thanks again. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Vishal, >>>>>>>>> >>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>>>>>>> clusters. Its recommended command. >>>>>>>>> >>>>>>>>> Use the following command to issue save point. >>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>>>> "cancel-job":false}' \ https:// >>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>> >>>>>>>>> Then issue yarn-cancel. >>>>>>>>> After that follow the process to restore save point >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Bhaskar >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hello Vijay, >>>>>>>>>> >>>>>>>>>> Thank you for the reply. This though is k8s >>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>>>> lifecycle. >>>>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>>>> a straight up >>>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>>> --data >>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>>>> \ https:// >>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>> >>>>>>>>>> I would assume that after taking the save point, the jvm should >>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job >>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. >>>>>>>>>> It does >>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the >>>>>>>>>> JobMaster, >>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the >>>>>>>>>> checkpoint >>>>>>>>>> counter but does not exit the job. And after a little bit the job is >>>>>>>>>> restarted which does not make sense and absolutely not the right >>>>>>>>>> thing to >>>>>>>>>> do ( to me at least ). >>>>>>>>>> >>>>>>>>>> Further if I delete the deployment and the job from k8s and >>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the >>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the >>>>>>>>>> save >>>>>>>>>> point. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s >>>>>>>>>> job cluster deployment seems to be >>>>>>>>>> >>>>>>>>>> - cancel with save point as defined hre >>>>>>>>>> >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>>>> - delete the job manger job and task manager deployments >>>>>>>>>> from k8s almost immediately. >>>>>>>>>> - clear the ZK chroot for the 0000000...... job and may be >>>>>>>>>> the checkpoints directory. >>>>>>>>>> - resumeFromCheckPoint >>>>>>>>>> >>>>>>>>>> If some body can say that this indeed is the process ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Logs are attached. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>> Savepoint stored in >>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>>> state >>>>>>>>>> RUNNING to CANCELLING. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>> - Completed checkpoint 10 for job >>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>>>> (1/1) >>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to >>>>>>>>>> CANCELING. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>>>> (1/1) >>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to >>>>>>>>>> CANCELED. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>>> state >>>>>>>>>> CANCELLING to CANCELED. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>>>> 00000000000000000000000000000000. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>> - Shutting down >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>> - Checkpoint with ID 8 at >>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>>>>>>> discarded. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>> - Removing >>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>>>> from ZooKeeper >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>> - Checkpoint with ID 10 at >>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' >>>>>>>>>> not >>>>>>>>>> discarded. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>> - Shutting down. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000 >>>>>>>>>> from ZooKeeper >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - >>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state >>>>>>>>>> CANCELED. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>> Stopping the JobMaster for job >>>>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>>>>>>> Shutting down rest endpoint. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>>>> /leader/resource_manager_lock. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>>>>>> JobManager is shutting down.. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>>> Suspending SlotPool. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>>> Stopping SlotPool. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>>>>> >>>>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> After a little bit >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Starting the job-cluster >>>>>>>>>> >>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>>>>>>>> `jobmanager.heap.size` >>>>>>>>>> >>>>>>>>>> Starting standalonejob as a console application on host >>>>>>>>>> anomalyecho-mmg6t. >>>>>>>>>> >>>>>>>>>> .. >>>>>>>>>> >>>>>>>>>> .. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Vishal >>>>>>>>>>> >>>>>>>>>>> Save point with cancellation internally use /cancel REST API. >>>>>>>>>>> Which is not stable API. It always exits with 404. Best way to >>>>>>>>>>> issue is: >>>>>>>>>>> >>>>>>>>>>> a) First issue save point REST API >>>>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>>>> ) >>>>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>>>> Above is the smoother way >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> Bhaskar >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> There are some issues I see and would want to get some feedback >>>>>>>>>>>> >>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the >>>>>>>>>>>> k8s job does not exit ( it is not a deployment ) . I would >>>>>>>>>>>> assume that on >>>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the >>>>>>>>>>>> pod >>>>>>>>>>>> should too. That does not happen and thus the job pod remains >>>>>>>>>>>> live. Is that >>>>>>>>>>>> expected ? >>>>>>>>>>>> >>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete >>>>>>>>>>>> the job id ( 0000000000.... ) from ZooKeeper ( this is HA ), else >>>>>>>>>>>> it >>>>>>>>>>>> defaults to the latest checkpoint no matter what >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested process >>>>>>>>>>>> of cancelling with a save point and resuming and what is the >>>>>>>>>>>> cogent story >>>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id >>>>>>>>>>>> does not >>>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still >>>>>>>>>>>> can not >>>>>>>>>>>> provide a new job id. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Vishal. >>>>>>>>>>>> >>>>>>>>>>>>