This really not cool but here you go. This seems to work. Agreed that this
cannot be this painful. The cancel does not exit with an exit code pf 0 and
thus the job has to manually delete. Vijay does this align with what you
have had to do ?
- Take a save point . This returns a request id
curl --header "Content-Type: application/json" --request POST
--data
'{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
https://*************/jobs/00000000000000000000000000000000/savepoints
- Make sure the save point succeeded
curl --request GET
https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
- cancel the job
curl --request PATCH
https://***************/jobs/00000000000000000000000000000000?mode=cancel
- Delete the job and deployment
kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
- Edit the job-cluster-job-deployment.yaml. Add/Edit
args: ["job-cluster",
"--fromSavepoint",
"hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
"--job-classname", .........
- Restart
kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
- Make sure from the UI, that it restored from the specific save point.
On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <[email protected]>
wrote:
> Yes Its supposed to work. But unfortunately it was not working. Flink
> community needs to respond to this behavior.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <[email protected]>
> wrote:
>
>> Aah.
>> Let me try this out and will get back to you.
>> Though I would assume that save point with cancel is a single atomic
>> step, rather then a save point *followed* by a cancellation ( else why
>> would that be an option ).
>> Thanks again.
>>
>>
>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <[email protected]>
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>> clusters. Its recommended command.
>>>
>>> Use the following command to issue save point.
>>> curl --header "Content-Type: application/json" --request POST --data
>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
>>> \ https://
>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>
>>> Then issue yarn-cancel.
>>> After that follow the process to restore save point
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>> [email protected]> wrote:
>>>
>>>> Hello Vijay,
>>>>
>>>> Thank you for the reply. This though is k8s deployment (
>>>> rather then yarn ) but may be they follow the same lifecycle. I issue a*
>>>> save point with cancel* as documented here
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>> a straight up
>>>> curl --header "Content-Type: application/json" --request POST --data
>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>> \ https://
>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>
>>>> I would assume that after taking the save point, the jvm should exit,
>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>> counter but does not exit the job. And after a little bit the job is
>>>> restarted which does not make sense and absolutely not the right thing to
>>>> do ( to me at least ).
>>>>
>>>> Further if I delete the deployment and the job from k8s and restart the
>>>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>>>> have to delete the zk chroot for it to consider the save point.
>>>>
>>>>
>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>> cluster deployment seems to be
>>>>
>>>> - cancel with save point as defined hre
>>>>
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>> - delete the job manger job and task manager deployments from k8s
>>>> almost immediately.
>>>> - clear the ZK chroot for the 0000000...... job and may be the
>>>> checkpoints directory.
>>>> - resumeFromCheckPoint
>>>>
>>>> If some body can say that this indeed is the process ?
>>>>
>>>>
>>>>
>>>> Logs are attached.
>>>>
>>>>
>>>>
>>>> 2019-03-12 08:10:43,871 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster -
>>>> Savepoint stored in
>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>> cancelling 00000000000000000000000000000000.
>>>>
>>>> 2019-03-12 08:10:43,871 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
>>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>>> to CANCELLING.
>>>>
>>>> 2019-03-12 08:10:44,227 INFO
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>> - Completed checkpoint 10 for job 00000000000000000000000000000000
>>>> (7238 bytes in 311 ms).
>>>>
>>>> 2019-03-12 08:10:44,232 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>
>>>> 2019-03-12 08:10:44,274 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph -
>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>
>>>> 2019-03-12 08:10:44,276 INFO
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>> CANCELLING to CANCELED.
>>>>
>>>> 2019-03-12 08:10:44,276 INFO
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>> - Stopping checkpoint coordinator for job
>>>> 00000000000000000000000000000000.
>>>>
>>>> 2019-03-12 08:10:44,277 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
>>>> Shutting down
>>>>
>>>> 2019-03-12 08:10:44,323 INFO
>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>> - Checkpoint with ID 8 at
>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>> discarded.
>>>>
>>>> 2019-03-12 08:10:44,437 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
>>>> Removing
>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>> from ZooKeeper
>>>>
>>>> 2019-03-12 08:10:44,437 INFO
>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>> - Checkpoint with ID 10 at
>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>> discarded.
>>>>
>>>> 2019-03-12 08:10:44,447 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter -
>>>> Shutting down.
>>>>
>>>> 2019-03-12 08:10:44,447 INFO
>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter -
>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>> ZooKeeper
>>>>
>>>> 2019-03-12 08:10:44,463 INFO
>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job
>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>
>>>> 2019-03-12 08:10:44,467 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster -
>>>> Stopping the JobMaster for job
>>>> anomaly_echo(00000000000000000000000000000000).
>>>>
>>>> 2019-03-12 08:10:44,468 INFO
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>> - Shutting StandaloneJobClusterEntryPoint down with
>>>> application status CANCELED. Diagnostics null.
>>>>
>>>> 2019-03-12 08:10:44,468 INFO
>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint -
>>>> Shutting down rest endpoint.
>>>>
>>>> 2019-03-12 08:10:44,473 INFO
>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>> /leader/resource_manager_lock.
>>>>
>>>> 2019-03-12 08:10:44,475 INFO
>>>> org.apache.flink.runtime.jobmaster.JobMaster - Close
>>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>>> shutting down..
>>>>
>>>> 2019-03-12 08:10:44,475 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool -
>>>> Suspending SlotPool.
>>>>
>>>> 2019-03-12 08:10:44,476 INFO
>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool -
>>>> Stopping SlotPool.
>>>>
>>>> 2019-03-12 08:10:44,476 INFO
>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>> 00000000000000000000000000000000 from the resource manager.
>>>>
>>>> 2019-03-12 08:10:44,477 INFO
>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>> - Stopping ZooKeeperLeaderElectionService
>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>
>>>>
>>>> After a little bit
>>>>
>>>>
>>>> Starting the job-cluster
>>>>
>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>> `jobmanager.heap.size`
>>>>
>>>> Starting standalonejob as a console application on host
>>>> anomalyecho-mmg6t.
>>>>
>>>> ..
>>>>
>>>> ..
>>>>
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Vishal
>>>>>
>>>>> Save point with cancellation internally use /cancel REST API. Which
>>>>> is not stable API. It always exits with 404. Best way to issue is:
>>>>>
>>>>> a) First issue save point REST API
>>>>> b) Then issue /yarn-cancel rest API( As described in
>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%[email protected]%3E
>>>>> )
>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>> for the run jar REST API, which is returned by the (a)
>>>>> Above is the smoother way
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> There are some issues I see and would want to get some feedback
>>>>>>
>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>>> job does not exit ( it is not a deployment ) . I would assume that on
>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>> should too. That does not happen and thus the job pod remains live. Is
>>>>>> that
>>>>>> expected ?
>>>>>>
>>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>>> id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults to
>>>>>> the latest checkpoint no matter what
>>>>>>
>>>>>>
>>>>>> I am kind of curious as to what in 1.7.2 is the tested process of
>>>>>> cancelling with a save point and resuming and what is the cogent story
>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>> provide a new job id.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Vishal.
>>>>>>
>>>>>>