Hi Vishal,
yarn-cancel doesn't mean to be for yarn cluster. It works for all clusters.
Its recommended command.
Use the following command to issue save point.
curl --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
\ https://
************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
Then issue yarn-cancel.
After that follow the process to restore save point
Regards
Bhaskar
On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <[email protected]>
wrote:
> Hello Vijay,
>
> Thank you for the reply. This though is k8s deployment (
> rather then yarn ) but may be they follow the same lifecycle. I issue a*
> save point with cancel* as documented here
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
> a straight up
> curl --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
> \ https://
> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>
> I would assume that after taking the save point, the jvm should exit,
> after all the k8s deployment is of kind: job and if it is a job cluster
> then a cancellation should exit the jvm and hence the pod. It does seem to
> do some things right. It stops a bunch of stuff ( the JobMaster, the
> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
> counter but does not exit the job. And after a little bit the job is
> restarted which does not make sense and absolutely not the right thing to
> do ( to me at least ).
>
> Further if I delete the deployment and the job from k8s and restart the
> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
> have to delete the zk chroot for it to consider the save point.
>
>
> Thus the process of cancelling and resuming from a SP on a k8s job cluster
> deployment seems to be
>
> - cancel with save point as defined hre
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
> - delete the job manger job and task manager deployments from k8s
> almost immediately.
> - clear the ZK chroot for the 0000000...... job and may be the
> checkpoints directory.
> - resumeFromCheckPoint
>
> If some body can say that this indeed is the process ?
>
>
>
> Logs are attached.
>
>
>
> 2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.jobmaster.JobMaster
> - Savepoint stored in
> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
> cancelling 00000000000000000000000000000000.
>
> 2019-03-12 08:10:43,871 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
> to CANCELLING.
>
> 2019-03-12 08:10:44,227 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> - Completed checkpoint 10 for job 00000000000000000000000000000000
> (7238 bytes in 311 ms).
>
> 2019-03-12 08:10:44,232 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source:
> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>
> 2019-03-12 08:10:44,274 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source:
> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>
> 2019-03-12 08:10:44,276 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job
> anomaly_echo (00000000000000000000000000000000) switched from state
> CANCELLING to CANCELED.
>
> 2019-03-12 08:10:44,276 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> - Stopping checkpoint coordinator for job
> 00000000000000000000000000000000.
>
> 2019-03-12 08:10:44,277 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
> Shutting down
>
> 2019-03-12 08:10:44,323 INFO
> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
> - Checkpoint with ID 8 at
> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
> discarded.
>
> 2019-03-12 08:10:44,437 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
> Removing
> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
> from ZooKeeper
>
> 2019-03-12 08:10:44,437 INFO
> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
> - Checkpoint with ID 10 at
> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
> discarded.
>
> 2019-03-12 08:10:44,447 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter -
> Shutting down.
>
> 2019-03-12 08:10:44,447 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter -
> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>
> 2019-03-12 08:10:44,463 INFO
> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job
> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>
> 2019-03-12 08:10:44,467 INFO org.apache.flink.runtime.jobmaster.JobMaster
> - Stopping the JobMaster for job
> anomaly_echo(00000000000000000000000000000000).
>
> 2019-03-12 08:10:44,468 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
> - Shutting StandaloneJobClusterEntryPoint down with application
> status CANCELED. Diagnostics null.
>
> 2019-03-12 08:10:44,468 INFO
> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting
> down rest endpoint.
>
> 2019-03-12 08:10:44,473 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>
> 2019-03-12 08:10:44,475 INFO org.apache.flink.runtime.jobmaster.JobMaster
> - Close ResourceManager connection
> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
>
> 2019-03-12 08:10:44,475 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool -
> Suspending SlotPool.
>
> 2019-03-12 08:10:44,476 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping
> SlotPool.
>
> 2019-03-12 08:10:44,476 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager -
> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
> 00000000000000000000000000000000 from the resource manager.
>
> 2019-03-12 08:10:44,477 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService -
> Stopping ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>
>
> After a little bit
>
>
> Starting the job-cluster
>
> used deprecated key `jobmanager.heap.mb`, please replace with key
> `jobmanager.heap.size`
>
> Starting standalonejob as a console application on host anomalyecho-mmg6t.
>
> ..
>
> ..
>
>
> Regards.
>
>
>
>
>
> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <[email protected]>
> wrote:
>
>> Hi Vishal
>>
>> Save point with cancellation internally use /cancel REST API. Which is
>> not stable API. It always exits with 404. Best way to issue is:
>>
>> a) First issue save point REST API
>> b) Then issue /yarn-cancel rest API( As described in
>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%[email protected]%3E
>> )
>> c) Then After resuming your job, provide save point Path as argument for
>> the run jar REST API, which is returned by the (a)
>> Above is the smoother way
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>> [email protected]> wrote:
>>
>>> There are some issues I see and would want to get some feedback
>>>
>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>> job does not exit ( it is not a deployment ) . I would assume that on
>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>> should too. That does not happen and thus the job pod remains live. Is that
>>> expected ?
>>>
>>> 2. To resume fro a save point it seems that I have to delete the job id
>>> ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults to the
>>> latest checkpoint no matter what
>>>
>>>
>>> I am kind of curious as to what in 1.7.2 is the tested process of
>>> cancelling with a save point and resuming and what is the cogent story
>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>> work with 1.7.2 so even though that does not make sense, I still can not
>>> provide a new job id.
>>>
>>> Regards,
>>>
>>> Vishal.
>>>
>>>