Re: K8s job cluster and cancel and resume from a save point ?

Vishal Santoshi Wed, 13 Mar 2019 07:17:39 -0700

BTW, does 1.8 also solve the issue where we can cancel with a save point.
That too is broken in 1.7.2


curl  --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":*true*}'
   https://*************/jobs/00000000000000000000000000000000/savepoints


On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> Awesome, thanks!
>
> On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <g...@ververica.com> wrote:
>
>> The RC artifacts are only deployed to the Maven Central Repository when
>> the RC
>> is promoted to a release. As written in the 1.8.0 RC1 voting email [1],
>> you
>> can find the maven artifacts, and the Flink binaries here:
>>
>>     -
>> https://repository.apache.org/content/repositories/orgapacheflink-1210/
>>     - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>>
>> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
>> sources [2]. On my machine this takes around 10 minutes if tests are
>> skipped.
>>
>> Best,
>> Gary
>>
>> [1]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>>
>> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>>> candidate. We could test it for you.
>>>
>>> Without 1.8and this exit code we are essentially held up.
>>>
>>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote:
>>>
>>>> Nobody can tell with 100% certainty. We want to give the RC some
>>>> exposure
>>>> first, and there is also a release process that is prescribed by the
>>>> ASF [1].
>>>> You can look at past releases to get a feeling for how long the release
>>>> process lasts [2].
>>>>
>>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>>> [2]
>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> And when is the 1.8.0 release expected ?
>>>>>
>>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>>>>> release ?
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote:
>>>>>>
>>>>>>> Hi Vishal,
>>>>>>>
>>>>>>> This issue was fixed recently [1], and the patch will be released
>>>>>>> with 1.8. If
>>>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There
>>>>>>> is a
>>>>>>> release candidate [2], which you can test.
>>>>>>>
>>>>>>> Best,
>>>>>>> Gary
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>>>> [2]
>>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Vijay,
>>>>>>>>
>>>>>>>> This is the larger issue.  The cancellation routine is itself
>>>>>>>> broken.
>>>>>>>>
>>>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>>>
>>>>>>>> *2019-03-12 14:12:13,143
>>>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter 
>>>>>>>>  -
>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>>>> ZooKeeper *
>>>>>>>>
>>>>>>>> but exist with a non zero code
>>>>>>>>
>>>>>>>> *2019-03-12 14:12:13,477
>>>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint 
>>>>>>>> with
>>>>>>>> exit code 1444.*
>>>>>>>>
>>>>>>>>
>>>>>>>> That I think is an issue. A cancelled job is a complete job and
>>>>>>>> thus the exit code should be 0 for k8s to mark it complete.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes Vishal. Thats correct.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>>>> that this cannot be this painful. The cancel does not exit with an 
>>>>>>>>>> exit
>>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this 
>>>>>>>>>> align
>>>>>>>>>> with what you have had to do ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>>>
>>>>>>>>>>    curl  --header "Content-Type: application/json" --request POST 
>>>>>>>>>> --data 
>>>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>>>>>>     
>>>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>>>
>>>>>>>>>>    curl  --request GET   
>>>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - cancel the job
>>>>>>>>>>
>>>>>>>>>>    curl  --request PATCH 
>>>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Delete the job and deployment
>>>>>>>>>>
>>>>>>>>>>    kubectl delete -f 
>>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>    kubectl delete -f 
>>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>>>
>>>>>>>>>>    args: ["job-cluster",
>>>>>>>>>>
>>>>>>>>>>                   "--fromSavepoint",
>>>>>>>>>>
>>>>>>>>>>                   
>>>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>>>                   "--job-classname", .........
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Restart
>>>>>>>>>>
>>>>>>>>>>    kubectl create -f 
>>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>    kubectl create -f 
>>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>>>    save point.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Aah.
>>>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>>>> Thanks again.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>>>
>>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>> POST --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>>>>>>> lifecycle.
>>>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>>>> a straight up
>>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>>> POST --data
>>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>>>> \ https://
>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would assume that after taking the save point, the jvm
>>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and if 
>>>>>>>>>>>>>> it is a
>>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence 
>>>>>>>>>>>>>> the pod. It
>>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also 
>>>>>>>>>>>>>> remove the
>>>>>>>>>>>>>> checkpoint counter but does not exit  the job. And after a 
>>>>>>>>>>>>>> little bit the
>>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not 
>>>>>>>>>>>>>> the right
>>>>>>>>>>>>>> thing to do  ( to me at least ).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to 
>>>>>>>>>>>>>> honor the
>>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider 
>>>>>>>>>>>>>> the save
>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a
>>>>>>>>>>>>>> k8s job cluster deployment  seems to be
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may
>>>>>>>>>>>>>>    be the checkpoints directory.
>>>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>   - Savepoint stored in
>>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>> switched from state RUNNING to CANCELLING.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched 
>>>>>>>>>>>>>> from RUNNING
>>>>>>>>>>>>>> to CANCELING.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched 
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> CANCELING to CANCELED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>> switched from state CANCELLING to CANCELED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' 
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae'
>>>>>>>>>>>>>>  not
>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from 
>>>>>>>>>>>>>> ZooKeeper
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher
>>>>>>>>>>>>>>   - Job 00000000000000000000000000000000 reached globally
>>>>>>>>>>>>>> terminal state CANCELED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>   - Stopping the JobMaster for job
>>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint
>>>>>>>>>>>>>> - Shutting down rest endpoint.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>   - Close ResourceManager connection
>>>>>>>>>>>>>> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>   - Suspending SlotPool.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>   - Stopping SlotPool.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  
>>>>>>>>>>>>>>> way to issue
>>>>>>>>>>>>>>> is:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I 
>>>>>>>>>>>>>>>> would assume
>>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, 
>>>>>>>>>>>>>>>> and thus the
>>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod 
>>>>>>>>>>>>>>>> remains live. Is
>>>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to
>>>>>>>>>>>>>>>> delete the job id ( 0000000000.... )  from ZooKeeper ( this is 
>>>>>>>>>>>>>>>> HA ), else
>>>>>>>>>>>>>>>> it defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what 
>>>>>>>>>>>>>>>> is the
>>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). 
>>>>>>>>>>>>>>>> Note that
>>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not 
>>>>>>>>>>>>>>>> make sense,
>>>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to