Hi Derek,

what I would recommend to use is to trigger the cancel with savepoint
command [1]. This will create a savepoint and terminate the job execution.
Next you simply need to respawn the job cluster which you provide with the
savepoint to resume from.

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint

Cheers,
Till

On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <and...@data-artisans.com>
wrote:

> Hi Derek,
>
> I think your automation steps look good.
> Recreating deployments should not take long
> and as you mention, this way you can avoid unpredictable old/new version
> collisions.
>
> Best,
> Andrey
>
> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org>
> wrote:
> >
> > Hi Derek,
> >
> > I am not an expert in kubernetes, so I will cc Till, who should be able
> > to help you more.
> >
> > As for the automation for similar process I would recommend having a
> > look at dA platform[1] which is built on top of kubernetes.
> >
> > Best,
> >
> > Dawid
> >
> > [1] https://data-artisans.com/platform-overview
> >
> > On 30/11/2018 02:10, Derek VerLee wrote:
> >>
> >> I'm looking at the job cluster mode, it looks great and I and
> >> considering migrating our jobs off our "legacy" session cluster and
> >> into Kubernetes.
> >>
> >> I do need to ask some questions because I haven't found a lot of
> >> details in the documentation about how it works yet, and I gave up
> >> following the the DI around in the code after a while.
> >>
> >> Let's say I have a deployment for the job "leader" in HA with ZK, and
> >> another deployment for the taskmanagers.
> >>
> >> I want to upgrade the code or configuration and start from a
> >> savepoint, in an automated way.
> >>
> >> Best I can figure, I can not just update the deployment resources in
> >> kubernetes and allow the containers to restart in an arbitrary order.
> >>
> >> Instead, I expect sequencing is important, something along the lines
> >> of this:
> >>
> >> 1. issue savepoint command on leader
> >> 2. wait for savepoint
> >> 3. destroy all leader and taskmanager containers
> >> 4. deploy new leader, with savepoint url
> >> 5. deploy new taskmanagers
> >>
> >>
> >> For example, I imagine old taskmanagers (with an old version of my
> >> job) attaching to the new leader and causing a problem.
> >>
> >> Does that sound right, or am I overthinking it?
> >>
> >> If not, has anyone tried implementing any automation for this yet?
> >>
> >
>
>

Reply via email to