Hi Derek, what I would recommend to use is to trigger the cancel with savepoint command [1]. This will create a savepoint and terminate the job execution. Next you simply need to respawn the job cluster which you provide with the savepoint to resume from.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint Cheers, Till On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <and...@data-artisans.com> wrote: > Hi Derek, > > I think your automation steps look good. > Recreating deployments should not take long > and as you mention, this way you can avoid unpredictable old/new version > collisions. > > Best, > Andrey > > > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org> > wrote: > > > > Hi Derek, > > > > I am not an expert in kubernetes, so I will cc Till, who should be able > > to help you more. > > > > As for the automation for similar process I would recommend having a > > look at dA platform[1] which is built on top of kubernetes. > > > > Best, > > > > Dawid > > > > [1] https://data-artisans.com/platform-overview > > > > On 30/11/2018 02:10, Derek VerLee wrote: > >> > >> I'm looking at the job cluster mode, it looks great and I and > >> considering migrating our jobs off our "legacy" session cluster and > >> into Kubernetes. > >> > >> I do need to ask some questions because I haven't found a lot of > >> details in the documentation about how it works yet, and I gave up > >> following the the DI around in the code after a while. > >> > >> Let's say I have a deployment for the job "leader" in HA with ZK, and > >> another deployment for the taskmanagers. > >> > >> I want to upgrade the code or configuration and start from a > >> savepoint, in an automated way. > >> > >> Best I can figure, I can not just update the deployment resources in > >> kubernetes and allow the containers to restart in an arbitrary order. > >> > >> Instead, I expect sequencing is important, something along the lines > >> of this: > >> > >> 1. issue savepoint command on leader > >> 2. wait for savepoint > >> 3. destroy all leader and taskmanager containers > >> 4. deploy new leader, with savepoint url > >> 5. deploy new taskmanagers > >> > >> > >> For example, I imagine old taskmanagers (with an old version of my > >> job) attaching to the new leader and causing a problem. > >> > >> Does that sound right, or am I overthinking it? > >> > >> If not, has anyone tried implementing any automation for this yet? > >> > > > >