Hi Weihua! Thanks so much for your answer, it's pretty apparent that it's absolutely not trivial to do with your explanation. Have you (or any other) been able to apply something close to this Green/Blue deployment idea from this Zero Downtime Upgrade presentation: https://www.youtube.com/watch?v=Hyt3YrtKQAM ? And if so, how did you manage to automate and orchestrate it (including rollbacking if the newer job doesn't work properly for example)
Le mer. 29 juin 2022 à 15:38, Weihua Hu <huweihua....@gmail.com> a écrit : > Hi, Robin > > This is a good question, but Flink can't do rolling upgrades. > > I'll try to explain the cost of Flink's support for RollingUpgrade. > 1. There is a shuffle connection between tasks in a Region, and in order > to ensure the consistency of the data processed by the upstream and > downstream tasks in the same region, these tasks need to Failover/Restart > at the same time. So the cost of RollingUpgrade for jobs with Keyby/Hash > shuffle is not much different from the cost of stop/start jobs. > 2. In order to maintain the accuracy of RollingUpgrade, many details need > to be considered, such as checkpoint coordinator, failover handling, and > operator coordinator. > > But, IMO, Job RollingUpgrade is a worthwhile thing to do. > > Best, > Weihua > > > On Wed, Jun 29, 2022 at 3:52 PM Robin Cassan < > robin.cas...@contentsquare.com> wrote: > >> Hi all! We are running a flink cluster on kubernetes and deploying a >> single job on it through "flink run <jar>". Whenever we want to modify the >> jar, we cancel the job and run the "flink run" command again, with the new >> jar, and the retained checkpoint URL from the first run. >> This works well, but this adds some unavoidable downtime for each update: >> - Downtime in-between the "cancel" and the "run" >> - Downtime during restoration of the state >> >> We aim at reducing this downtime to a minimum and are wondering if there >> is a way to submit a new version of a job without completely stopping the >> old one, having each TM running the new jar one at a time and waiting for >> the state to be restored before moving on to the next TM? This would limit >> the downtime to only a single TM at a time >> >> Otherwise, we could try to submit a second job and let it restore before >> canceling the old one, but this raises complex synchronisation issues, >> especially since the second job will be restoring the first job's retained >> (incremental) checkpoint, which seems like a bad idea... >> >> What would be the best way to reduce downtime in this scenario, in your >> opinion? >> >> Thanks! >> >