Thnx all:
1) for now, we will try with inhouse kubernetes, and see how it goes.
2) Till, cheers, I'll give a stab, though likely I'll end up with an
operator or some other workflow tool ( I've gotten multiple weird looks
when I mentioned init container approach at work; I was mostly curios at
this point if I can see what's so obviously wrong with init container
approach ).

Regarding: What is job manager is down? Well, in that case, or if job
manager is restarting ( so I can't create a savepoint anyway); only way to
upgrade would be, as you suggested, externalised checkpoints.
Otherwise, only option would be to wait to start upgrade, until job manager
stops restarting ( if it's an external dependency that is causing it), and
resume from the checkpoint.

The complexity of job manager being in restarting mode, is something I'd
prefer not to handle in init container; afaik, if job is restarting, we
shouldn't even try to do upgrade ( or, we could, if we are okey with losing
the state). Operator sounds much more sane way to handle this.

On Thu, 30 Apr 2020 at 17:59, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Barisa,
>
> from what you've described I believe it could work. But I never tried it
> out. Maybe you could report back once you tried it out. I believe it would
> be interesting to hear your experience with this approach.
>
> One thing to note is that the approach hinges on the fact that the older
> JobManager is still running. If for whatever reason the old JobManager
> fails shortly before the new one comes up, then you might not execute the
> job you want to upgrade. You could mitigate the problem by using
> externalized checkpoints [1] but then you would fall back to an earlier
> point.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints
>
> Cheers,
> Till
>
> On Thu, Apr 30, 2020 at 3:38 PM Alexander Fedulov <alexan...@ververica.com>
> wrote:
>
>> Hi Barisa,
>>
>> it seems that there is no immediate answer to your concrete question
>> here, so I wanted to ask you back a more general question: did you consider
>> using the Community Edition of Ververica Platform for your purposes [1]
>> <https://www.ververica.com/blog/announcing-ververica-platform-community-edition>?
>> It comes with a complete lifecycle management for Flink jobs on K8S. It
>> also exposes a full REST API for integrating into CI/CD workflows, so if
>> you do not need the UI, you can just ignore it. Community Edition is
>> permanently free for commercial use at any scale.
>>
>> I see that you are already using Helm, so installation could be very
>> straightforward [2] <https://www.ververica.com/getting-started>.
>> Here is the documentation with a bit more comprehensive "Getting started"
>> guide [3] <https://docs.ververica.com/getting_started/index.html>.
>>
>> [1]
>> https://www.ververica.com/blog/announcing-ververica-platform-community-edition
>> [2] https://www.ververica.com/getting-started
>> [3] https://docs.ververica.com/getting_started/index.html
>>
>> Best regards,
>>
>> --
>>
>> Alexander Fedulov | Solutions Architect
>>
>> +49 1514 6265796
>>
>> <https://www.ververica.com/>
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>>
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
>> (Tony) Cheng
>>
>>
>>
>> On Wed, Apr 29, 2020 at 5:32 PM Barisa Obradovic <bbaj...@gmail.com>
>> wrote:
>>
>>> Hi, we are attempting to migrate our flink cluster to K8, and are looking
>>> into options how to automate job upgrades; wondering if anyone here has
>>> done
>>> it with init container? Or if there is a simpler way?
>>>
>>> 0: So, let's assume we have a job manager with few task managers
>>> running, in
>>> a stateful set; managed with helm.
>>>
>>> 1: New helm chart is published, and helm attempts the upgrade.
>>> Since it's a stateful set, new version of job manager and taskmanager is
>>> started even while old one is still running.
>>> 2: In the job manager pod, there is an init container, whose purpose it
>>> to
>>> find currently running job manager with previous version of JOB ( either
>>> via
>>> zookeeper or kubernetes service which points to currently running job
>>> manager). After it finds it, it runs cancel with savepoint using flink
>>> CLI,
>>> and passes the savepoint URL via volume to main container.
>>> 3: job manager container starts, it finds the savepoint, and restores the
>>> new version of job, with the state from savepoint.
>>> 4: new pods are passing healthchecks, so old pods are destroyed by
>>> kubernetes.
>>>
>>>
>>> What happens if there is no previous job manager running? init container
>>> sees that, and just exits without any other work.
>>>
>>>
>>>
>>>
>>>
>>> Caveat:
>>> Most of solutions I noticed were using operators, which feel quite a bit
>>> more complex, yet since I haven't found any solution using init
>>> container,
>>> I'm guessing I'm missing something, just can't figure out what?
>>>
>>>
>>>
>>> --
>>> Sent from:
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>
>>

Reply via email to