Thnx all: 1) for now, we will try with inhouse kubernetes, and see how it goes. 2) Till, cheers, I'll give a stab, though likely I'll end up with an operator or some other workflow tool ( I've gotten multiple weird looks when I mentioned init container approach at work; I was mostly curios at this point if I can see what's so obviously wrong with init container approach ).
Regarding: What is job manager is down? Well, in that case, or if job manager is restarting ( so I can't create a savepoint anyway); only way to upgrade would be, as you suggested, externalised checkpoints. Otherwise, only option would be to wait to start upgrade, until job manager stops restarting ( if it's an external dependency that is causing it), and resume from the checkpoint. The complexity of job manager being in restarting mode, is something I'd prefer not to handle in init container; afaik, if job is restarting, we shouldn't even try to do upgrade ( or, we could, if we are okey with losing the state). Operator sounds much more sane way to handle this. On Thu, 30 Apr 2020 at 17:59, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Barisa, > > from what you've described I believe it could work. But I never tried it > out. Maybe you could report back once you tried it out. I believe it would > be interesting to hear your experience with this approach. > > One thing to note is that the approach hinges on the fact that the older > JobManager is still running. If for whatever reason the old JobManager > fails shortly before the new one comes up, then you might not execute the > job you want to upgrade. You could mitigate the problem by using > externalized checkpoints [1] but then you would fall back to an earlier > point. > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints > > Cheers, > Till > > On Thu, Apr 30, 2020 at 3:38 PM Alexander Fedulov <alexan...@ververica.com> > wrote: > >> Hi Barisa, >> >> it seems that there is no immediate answer to your concrete question >> here, so I wanted to ask you back a more general question: did you consider >> using the Community Edition of Ververica Platform for your purposes [1] >> <https://www.ververica.com/blog/announcing-ververica-platform-community-edition>? >> It comes with a complete lifecycle management for Flink jobs on K8S. It >> also exposes a full REST API for integrating into CI/CD workflows, so if >> you do not need the UI, you can just ignore it. Community Edition is >> permanently free for commercial use at any scale. >> >> I see that you are already using Helm, so installation could be very >> straightforward [2] <https://www.ververica.com/getting-started>. >> Here is the documentation with a bit more comprehensive "Getting started" >> guide [3] <https://docs.ververica.com/getting_started/index.html>. >> >> [1] >> https://www.ververica.com/blog/announcing-ververica-platform-community-edition >> [2] https://www.ververica.com/getting-started >> [3] https://docs.ververica.com/getting_started/index.html >> >> Best regards, >> >> -- >> >> Alexander Fedulov | Solutions Architect >> >> +49 1514 6265796 >> >> <https://www.ververica.com/> >> >> Follow us @VervericaData >> >> -- >> >> Join Flink Forward <https://flink-forward.org/> - The Apache Flink >> Conference >> >> Stream Processing | Event Driven | Real Time >> >> -- >> >> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >> >> -- >> >> Ververica GmbH >> Registered at Amtsgericht Charlottenburg: HRB 158244 B >> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji >> (Tony) Cheng >> >> >> >> On Wed, Apr 29, 2020 at 5:32 PM Barisa Obradovic <bbaj...@gmail.com> >> wrote: >> >>> Hi, we are attempting to migrate our flink cluster to K8, and are looking >>> into options how to automate job upgrades; wondering if anyone here has >>> done >>> it with init container? Or if there is a simpler way? >>> >>> 0: So, let's assume we have a job manager with few task managers >>> running, in >>> a stateful set; managed with helm. >>> >>> 1: New helm chart is published, and helm attempts the upgrade. >>> Since it's a stateful set, new version of job manager and taskmanager is >>> started even while old one is still running. >>> 2: In the job manager pod, there is an init container, whose purpose it >>> to >>> find currently running job manager with previous version of JOB ( either >>> via >>> zookeeper or kubernetes service which points to currently running job >>> manager). After it finds it, it runs cancel with savepoint using flink >>> CLI, >>> and passes the savepoint URL via volume to main container. >>> 3: job manager container starts, it finds the savepoint, and restores the >>> new version of job, with the state from savepoint. >>> 4: new pods are passing healthchecks, so old pods are destroyed by >>> kubernetes. >>> >>> >>> What happens if there is no previous job manager running? init container >>> sees that, and just exits without any other work. >>> >>> >>> >>> >>> >>> Caveat: >>> Most of solutions I noticed were using operators, which feel quite a bit >>> more complex, yet since I haven't found any solution using init >>> container, >>> I'm guessing I'm missing something, just can't figure out what? >>> >>> >>> >>> -- >>> Sent from: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >>> >>