hi all--we've run into a gap (knowledge? design? tbd?) for our use cases when deploying Flink jobs to start from savepoints using the job-cluster mode in Kubernetes.
we're running a ~15 different jobs, all in job-cluster mode, using a mix of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these are all long-running streaming jobs, all essentially acting as microservices. we're using Helm charts to configure all of our deployments. we have a number of use cases where we want to restart jobs from a savepoint to replay recent events, i.e. when we've enhanced the job logic or fixed a bug. but after the deployment we want to have the job resume it's "long-running" behavior, where any unplanned restarts resume from the latest checkpoint. the issue we run into is that any obvious/standard/idiomatic Kubernetes deployment includes the savepoint argument in the configuration. if the Job Manager container(s) have an unplanned restart, when they come back up they will start from the savepoint instead of resuming from the latest checkpoint. everything is working as configured, but that's not exactly what we want. we want the savepoint argument to be transient somehow (only used during the initial deployment), but Kubernetes doesn't really support the concept of transient configuration. i can see a couple of potential solutions that either involve custom code in the jobs or custom logic in the container (i.e. a custom entrypoint script that records that the configured savepoint has already been used in a file on a persistent volume or GCS, and potentially when/why/by which deployment). but these seem like unexpected and hacky solutions. before we head down that road i wanted to ask: - is this is already a solved problem that i've missed? - is this issue already on the community's radar? thanks in advance! -- *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305 <http://www.bettercloud.com> <http://www.bettercloud.com> *Altitude 2019 in San Francisco | Sept. 23 - 25* It’s not just an IT conference, it’s “a complete learning and networking experience” <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>