Re: K8s job cluster and cancel and resume from a save point ?

Vishal Santoshi Tue, 12 Mar 2019 08:56:10 -0700

Awesome, thanks!

On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <g...@ververica.com> wrote:


> The RC artifacts are only deployed to the Maven Central Repository when
> the RC
> is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
> can find the maven artifacts, and the Flink binaries here:
>
>     -
> https://repository.apache.org/content/repositories/orgapacheflink-1210/
>     - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>
> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
> sources [2]. On my machine this takes around 10 minutes if tests are
> skipped.
>
> Best,
> Gary
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>
> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>> candidate. We could test it for you.
>>
>> Without 1.8and this exit code we are essentially held up.
>>
>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote:
>>
>>> Nobody can tell with 100% certainty. We want to give the RC some exposure
>>> first, and there is also a release process that is prescribed by the ASF
>>> [1].
>>> You can look at past releases to get a feeling for how long the release
>>> process lasts [2].
>>>
>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>> [2]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>
>>>
>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> And when is the 1.8.0 release expected ?
>>>>
>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>>>> release ?
>>>>>
>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote:
>>>>>
>>>>>> Hi Vishal,
>>>>>>
>>>>>> This issue was fixed recently [1], and the patch will be released
>>>>>> with 1.8. If
>>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There
>>>>>> is a
>>>>>> release candidate [2], which you can test.
>>>>>>
>>>>>> Best,
>>>>>> Gary
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>>> [2]
>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Vijay,
>>>>>>>
>>>>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>>>>
>>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>>
>>>>>>> *2019-03-12 14:12:13,143
>>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  
>>>>>>> -
>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>>> ZooKeeper *
>>>>>>>
>>>>>>> but exist with a non zero code
>>>>>>>
>>>>>>> *2019-03-12 14:12:13,477
>>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint 
>>>>>>> with
>>>>>>> exit code 1444.*
>>>>>>>
>>>>>>>
>>>>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>>>>> the exit code should be 0 for k8s to mark it complete.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes Vishal. Thats correct.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>>> that this cannot be this painful. The cancel does not exit with an 
>>>>>>>>> exit
>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this 
>>>>>>>>> align
>>>>>>>>> with what you have had to do ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>>
>>>>>>>>>    curl  --header "Content-Type: application/json" --request POST 
>>>>>>>>> --data 
>>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>>>>>     
>>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>>
>>>>>>>>>    curl  --request GET   
>>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - cancel the job
>>>>>>>>>
>>>>>>>>>    curl  --request PATCH 
>>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Delete the job and deployment
>>>>>>>>>
>>>>>>>>>    kubectl delete -f 
>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>
>>>>>>>>>    kubectl delete -f 
>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>>
>>>>>>>>>    args: ["job-cluster",
>>>>>>>>>
>>>>>>>>>                   "--fromSavepoint",
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>>                   "--job-classname", .........
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Restart
>>>>>>>>>
>>>>>>>>>    kubectl create -f 
>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>
>>>>>>>>>    kubectl create -f 
>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>>    save point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Aah.
>>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>>> Thanks again.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>>
>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>>
>>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>
>>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>>
>>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>>>>>> lifecycle.
>>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>>> a straight up
>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>> POST --data
>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>>> \ https://
>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would assume that after taking the save point, the jvm
>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if 
>>>>>>>>>>>>> it is a
>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence the 
>>>>>>>>>>>>> pod. It
>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( the
>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also 
>>>>>>>>>>>>> remove the
>>>>>>>>>>>>> checkpoint counter but does not exit  the job. And after a little 
>>>>>>>>>>>>> bit the
>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not the 
>>>>>>>>>>>>> right
>>>>>>>>>>>>> thing to do  ( to me at least ).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider 
>>>>>>>>>>>>> the save
>>>>>>>>>>>>> point.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>>    
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may
>>>>>>>>>>>>>    be the checkpoints directory.
>>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>>
>>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>>> Savepoint stored in
>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>>>>> state
>>>>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging 
>>>>>>>>>>>>> Sink (1/1)
>>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to 
>>>>>>>>>>>>> CANCELING.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging 
>>>>>>>>>>>>> Sink (1/1)
>>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to 
>>>>>>>>>>>>> CANCELED.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>>>>> state
>>>>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' 
>>>>>>>>>>>>> not
>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae'
>>>>>>>>>>>>>  not
>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from 
>>>>>>>>>>>>> ZooKeeper
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal 
>>>>>>>>>>>>> state
>>>>>>>>>>>>> CANCELED.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint
>>>>>>>>>>>>> - Shutting down rest endpoint.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>>
>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>>>
>>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ..
>>>>>>>>>>>>>
>>>>>>>>>>>>> ..
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  
>>>>>>>>>>>>>> way to issue
>>>>>>>>>>>>>> is:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>>>>>>>> )
>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I 
>>>>>>>>>>>>>>> would assume
>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, 
>>>>>>>>>>>>>>> and thus the
>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod 
>>>>>>>>>>>>>>> remains live. Is
>>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), 
>>>>>>>>>>>>>>> else it
>>>>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what 
>>>>>>>>>>>>>>> is the
>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not 
>>>>>>>>>>>>>>> make sense,
>>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to