Re: K8s job cluster and cancel and resume from a save point ?

Gary Yao Tue, 12 Mar 2019 07:57:04 -0700

Nobody can tell with 100% certainty. We want to give the RC some exposure
first, and there is also a release process that is prescribed by the ASF
[1].
You can look at past releases to get a feeling for how long the release
process lasts [2].


[1] http://www.apache.org/legal/release-policy.html#release-approval
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0


On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> And when is the 1.8.0 release expected ?
>
> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> :) That makes so much more sense. Is  k8s native flink a part of this
>> release ?
>>
>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote:
>>
>>> Hi Vishal,
>>>
>>> This issue was fixed recently [1], and the patch will be released with
>>> 1.8. If
>>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>>> release candidate [2], which you can test.
>>>
>>> Best,
>>> Gary
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>> [2]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>
>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> Thanks Vijay,
>>>>
>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>
>>>> On cancellation flink does remove the checkpoint counter
>>>>
>>>> *2019-03-12 14:12:13,143
>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>> ZooKeeper *
>>>>
>>>> but exist with a non zero code
>>>>
>>>> *2019-03-12 14:12:13,477
>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>> exit code 1444.*
>>>>
>>>>
>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>> the exit code should be 0 for k8s to mark it complete.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>> bhaskar.eba...@gmail.com> wrote:
>>>>
>>>>> Yes Vishal. Thats correct.
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> This really not cool but here you go. This seems to work. Agreed that
>>>>>> this cannot be this painful. The cancel does not exit with an exit code 
>>>>>> pf
>>>>>> 0 and thus the job has to manually delete. Vijay does this align with 
>>>>>> what
>>>>>> you have had to do ?
>>>>>>
>>>>>>
>>>>>>    - Take a save point . This returns a request id
>>>>>>
>>>>>>    curl  --header "Content-Type: application/json" --request POST --data 
>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>>     
>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Make sure the save point succeeded
>>>>>>
>>>>>>    curl  --request GET   
>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - cancel the job
>>>>>>
>>>>>>    curl  --request PATCH 
>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Delete the job and deployment
>>>>>>
>>>>>>    kubectl delete -f 
>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>
>>>>>>    kubectl delete -f 
>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>
>>>>>>    args: ["job-cluster",
>>>>>>
>>>>>>                   "--fromSavepoint",
>>>>>>
>>>>>>                   
>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>                   "--job-classname", .........
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Restart
>>>>>>
>>>>>>    kubectl create -f 
>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>
>>>>>>    kubectl create -f 
>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>>    point.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>
>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Aah.
>>>>>>>> Let me try this out and will get back to you.
>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>> Thanks again.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Vishal,
>>>>>>>>>
>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>>> clusters. Its recommended command.
>>>>>>>>>
>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>
>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Vijay,
>>>>>>>>>>
>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>>> lifecycle.
>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>> a straight up
>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>> --data
>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>> \ https://
>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>
>>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a job
>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. 
>>>>>>>>>> It does
>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the 
>>>>>>>>>> JobMaster,
>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the 
>>>>>>>>>> checkpoint
>>>>>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>>>>>> restarted which does not make sense and absolutely not the right 
>>>>>>>>>> thing to
>>>>>>>>>> do  ( to me at least ).
>>>>>>>>>>
>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor the
>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the 
>>>>>>>>>> save
>>>>>>>>>> point.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>
>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>    
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>>    the checkpoints directory.
>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>
>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Logs are attached.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>> Savepoint stored in
>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>> state
>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>>>>> (1/1)
>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to 
>>>>>>>>>> CANCELING.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>>>>> (1/1)
>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to 
>>>>>>>>>> CANCELED.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>> state
>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>> - Shutting down
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>> discarded.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>> - Removing
>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>> from ZooKeeper
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' 
>>>>>>>>>> not
>>>>>>>>>> discarded.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>> - Shutting down.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000
>>>>>>>>>> from ZooKeeper
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>>> CANCELED.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>
>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> After a little bit
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>
>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>>>>>> `jobmanager.heap.size`
>>>>>>>>>>
>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>
>>>>>>>>>> ..
>>>>>>>>>>
>>>>>>>>>> ..
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>
>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to 
>>>>>>>>>>> issue is:
>>>>>>>>>>>
>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>>>>> )
>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>>
>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would 
>>>>>>>>>>>> assume that on
>>>>>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the 
>>>>>>>>>>>> pod
>>>>>>>>>>>> should too. That does not happen and thus the job pod remains 
>>>>>>>>>>>> live. Is that
>>>>>>>>>>>> expected ?
>>>>>>>>>>>>
>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else 
>>>>>>>>>>>> it
>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process
>>>>>>>>>>>> of cancelling with a save point and resuming  and what is the 
>>>>>>>>>>>> cogent story
>>>>>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id 
>>>>>>>>>>>> does not
>>>>>>>>>>>> work with 1.7.2 so even though that does not make sense, I still 
>>>>>>>>>>>> can not
>>>>>>>>>>>> provide a new job id.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to