Re: K8s job cluster and cancel and resume from a save point ?

Gary Yao Tue, 12 Mar 2019 07:38:37 -0700

Hi Vishal,

I'm afraid not but there are open pull requests for that. You can track the
progress here:


    https://issues.apache.org/jira/browse/FLINK-9953

Best,
Gary

On Tue, Mar 12, 2019 at 3:32 PM Vishal Santoshi <[email protected]>
wrote:

> :) That makes so much more sense. Is  k8s native flink a part of this
> release ?
>
> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <[email protected]> wrote:
>
>> Hi Vishal,
>>
>> This issue was fixed recently [1], and the patch will be released with
>> 1.8. If
>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>> release candidate [2], which you can test.
>>
>> Best,
>> Gary
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>
>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>> [email protected]> wrote:
>>
>>> Thanks Vijay,
>>>
>>> This is the larger issue.  The cancellation routine is itself broken.
>>>
>>> On cancellation flink does remove the checkpoint counter
>>>
>>> *2019-03-12 14:12:13,143
>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>> ZooKeeper *
>>>
>>> but exist with a non zero code
>>>
>>> *2019-03-12 14:12:13,477
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>> exit code 1444.*
>>>
>>>
>>> That I think is an issue. A cancelled job is a complete job and thus the
>>> exit code should be 0 for k8s to mark it complete.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <[email protected]>
>>> wrote:
>>>
>>>> Yes Vishal. Thats correct.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>> [email protected]> wrote:
>>>>
>>>>> This really not cool but here you go. This seems to work. Agreed that
>>>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>>>> you have had to do ?
>>>>>
>>>>>
>>>>>    - Take a save point . This returns a request id
>>>>>
>>>>>    curl  --header "Content-Type: application/json" --request POST --data 
>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>     https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>>
>>>>>
>>>>>    - Make sure the save point succeeded
>>>>>
>>>>>    curl  --request GET   
>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>
>>>>>
>>>>>
>>>>>    - cancel the job
>>>>>
>>>>>    curl  --request PATCH 
>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>
>>>>>
>>>>>
>>>>>    - Delete the job and deployment
>>>>>
>>>>>    kubectl delete -f 
>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>
>>>>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>
>>>>>
>>>>>
>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>
>>>>>    args: ["job-cluster",
>>>>>
>>>>>                   "--fromSavepoint",
>>>>>
>>>>>                   
>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>                   "--job-classname", .........
>>>>>
>>>>>
>>>>>
>>>>>    - Restart
>>>>>
>>>>>    kubectl create -f 
>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>
>>>>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>
>>>>>
>>>>>
>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>    point.
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>> Flink community needs to respond to this behavior.
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Aah.
>>>>>>> Let me try this out and will get back to you.
>>>>>>> Though I would assume that save point with cancel is a single atomic
>>>>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>>>>> why would that be an option ).
>>>>>>> Thanks again.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>> clusters. Its recommended command.
>>>>>>>>
>>>>>>>> Use the following command to issue save point.
>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>
>>>>>>>> Then issue yarn-cancel.
>>>>>>>> After that  follow the process to restore save point
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello Vijay,
>>>>>>>>>
>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>> lifecycle.
>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>> a straight up
>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>> --data
>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>> \ https://
>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. It 
>>>>>>>>> does
>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the 
>>>>>>>>> JobMaster,
>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the 
>>>>>>>>> checkpoint
>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>> restarted which does not make sense and absolutely not the right 
>>>>>>>>> thing to
>>>>>>>>> do  ( to me at least ).
>>>>>>>>>
>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the 
>>>>>>>>> save
>>>>>>>>> point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>>>>> cluster deployment  seems to be
>>>>>>>>>
>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>    
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>>>>    k8s almost immediately.
>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>    the checkpoints directory.
>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>
>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Logs are attached.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Savepoint stored in
>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>> state
>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>>>> (1/1)
>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>>>> (1/1)
>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to 
>>>>>>>>> CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>> state
>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>> - Shutting down
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>> discarded.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>> - Removing
>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>> from ZooKeeper
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>>>>> discarded.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Shutting down.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from 
>>>>>>>>> ZooKeeper
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>> CANCELED.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>> JobManager is shutting down..
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>> Suspending SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>> Stopping SlotPool.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>
>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After a little bit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Starting the job-cluster
>>>>>>>>>
>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>>>> `jobmanager.heap.size`
>>>>>>>>>
>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vishal
>>>>>>>>>>
>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to 
>>>>>>>>>> issue is:
>>>>>>>>>>
>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%[email protected]%3E
>>>>>>>>>> )
>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>> Above is the smoother way
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>
>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume 
>>>>>>>>>>> that on
>>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the 
>>>>>>>>>>> pod
>>>>>>>>>>> should too. That does not happen and thus the job pod remains live. 
>>>>>>>>>>> Is that
>>>>>>>>>>> expected ?
>>>>>>>>>>>
>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it 
>>>>>>>>>>> defaults
>>>>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process
>>>>>>>>>>> of cancelling with a save point and resuming  and what is the 
>>>>>>>>>>> cogent story
>>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id 
>>>>>>>>>>> does not
>>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still 
>>>>>>>>>>> can not
>>>>>>>>>>> provide a new job id.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Vishal.
>>>>>>>>>>>
>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to