Re: K8s job cluster and cancel and resume from a save point ?

Vishal Santoshi Tue, 12 Mar 2019 01:42:40 -0700

Hello Vijay,

               Thank you for the reply. This though is k8s deployment (
rather then yarn ) but may be they follow the same lifecycle.  I issue a*
save point with cancel*  as documented here
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
a straight up
 curl  --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
\ https://
************.ingress.*******/jobs/00000000000000000000000000000000/savepoints


I would assume that after taking the save point, the jvm should exit, after
all the k8s deployment is of kind: job and if it is a job cluster then a
cancellation should exit the jvm and hence the pod. It does seem to do some
things right. It stops a bunch of stuff ( the JobMaster, the slotPol,
zookeeper coordinator etc ) . It also remove the checkpoint counter but
does not exit  the job. And after a little bit the job is restarted which
does not make sense and absolutely not the right thing to do  ( to me at
least ).

Further if I delete the deployment and the job from k8s and restart the job
and deployment fromSavePoint, it refuses to honor the fromSavePoint. I have
to delete the zk chroot for it to consider the save point.


Thus the process of cancelling and resuming from a SP on a k8s job cluster
deployment  seems to be

   - cancel with save point as defined hre
   
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
   - delete the job manger job  and task manager deployments from k8s
   almost immediately.
   - clear the ZK chroot for the 0000000...... job  and may be the
   checkpoints directory.
   - resumeFromCheckPoint

If some body can say that this indeed is the process ?



 Logs are attached.



2019-03-12 08:10:43,871 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Savepoint stored in
hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
cancelling 00000000000000000000000000000000.

2019-03-12 08:10:43,871 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
to CANCELLING.

2019-03-12 08:10:44,227 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
    - Completed checkpoint 10 for job 00000000000000000000000000000000
(7238 bytes in 311 ms).

2019-03-12 08:10:44,232 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
(e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.

2019-03-12 08:10:44,274 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
(e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.

2019-03-12 08:10:44,276 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
anomaly_echo (00000000000000000000000000000000) switched from state
CANCELLING to CANCELED.

2019-03-12 08:10:44,276 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
    - Stopping checkpoint coordinator for job
00000000000000000000000000000000.

2019-03-12 08:10:44,277 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Shutting down

2019-03-12 08:10:44,323 INFO
org.apache.flink.runtime.checkpoint.CompletedCheckpoint
      - Checkpoint with ID 8 at
'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
discarded.

2019-03-12 08:10:44,437 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
Removing
/k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
from ZooKeeper

2019-03-12 08:10:44,437 INFO
org.apache.flink.runtime.checkpoint.CompletedCheckpoint
      - Checkpoint with ID 10 at
'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
discarded.

2019-03-12 08:10:44,447 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Shutting down.

2019-03-12 08:10:44,447 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper

2019-03-12 08:10:44,463 INFO
org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
00000000000000000000000000000000 reached globally terminal state CANCELED.

2019-03-12 08:10:44,467 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Stopping the JobMaster for job
anomaly_echo(00000000000000000000000000000000).

2019-03-12 08:10:44,468 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint
        - Shutting StandaloneJobClusterEntryPoint down with application
status CANCELED. Diagnostics null.

2019-03-12 08:10:44,468 INFO
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  - Shutting
down rest endpoint.

2019-03-12 08:10:44,473 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.

2019-03-12 08:10:44,475 INFO  org.apache.flink.runtime.jobmaster.JobMaster
                - Close ResourceManager connection
d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..

2019-03-12 08:10:44,475 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Suspending
SlotPool.

2019-03-12 08:10:44,476 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Stopping
SlotPool.

2019-03-12 08:10:44,476 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
@akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
00000000000000000000000000000000 from the resource manager.

2019-03-12 08:10:44,477 INFO
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Stopping ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.


After a little bit


Starting the job-cluster

used deprecated key `jobmanager.heap.mb`, please replace with key
`jobmanager.heap.size`

Starting standalonejob as a console application on host anomalyecho-mmg6t.

..

..


Regards.





On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
wrote:

> Hi Vishal
>
> Save point with cancellation internally use  /cancel  REST API. Which is
> not stable API.  It always exits with 404. Best  way to issue is:
>
> a) First issue save point REST API
> b) Then issue  /yarn-cancel  rest API( As described in
> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
> )
> c) Then After resuming your job, provide save point Path as argument for
> the run jar REST API, which is returned by the (a)
> Above is the smoother way
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> There are some issues I see and would want to get some feedback
>>
>> 1. On Cancellation With SavePoint with a Target Directory , the k8s  job
>> does not exit ( it is not a deployment ) . I would assume that on
>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>> should too. That does not happen and thus the job pod remains live. Is that
>> expected ?
>>
>> 2. To resume fro a save point it seems that I have to delete the job id (
>> 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
>> latest checkpoint no matter what
>>
>>
>> I am kind of curious as to what in 1.7.2 is the tested  process of
>> cancelling with a save point and resuming  and what is the cogent story
>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>> work with 1.7.2 so even though that does not make sense, I still can not
>> provide a new job id.
>>
>> Regards,
>>
>> Vishal.
>>
>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to