Re: K8s job cluster and cancel and resume from a save point ?

Vijay Bhaskar Tue, 12 Mar 2019 00:26:00 -0700

Hi Vishal

Save point with cancellation internally use  /cancel  REST API. Which is
not stable API.  It always exits with 404. Best  way to issue is:


a) First issue save point REST API
b) Then issue  /yarn-cancel  rest API( As described in
http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
)
c) Then After resuming your job, provide save point Path as argument for
the run jar REST API, which is returned by the (a)
Above is the smoother way

Regards
Bhaskar

On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> There are some issues I see and would want to get some feedback
>
> 1. On Cancellation With SavePoint with a Target Directory , the k8s  job
> does not exit ( it is not a deployment ) . I would assume that on
> cancellation the jvm should exit, after cleanup etc, and thus the pod
> should too. That does not happen and thus the job pod remains live. Is that
> expected ?
>
> 2. To resume fro a save point it seems that I have to delete the job id (
> 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to the
> latest checkpoint no matter what
>
>
> I am kind of curious as to what in 1.7.2 is the tested  process of
> cancelling with a save point and resuming  and what is the cogent story
> around job id ( defaults to 000000000000.. ). Note that --job-id does not
> work with 1.7.2 so even though that does not make sense, I still can not
> provide a new job id.
>
> Regards,
>
> Vishal.
>
>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to