My flink job runs in kubernetes. This is the setup: 1. One job running as a job cluster with one job manager 2. HA powered by zookeeper (works fine) 3. Job/Deployment manifests stored in Github and deployed to kubernetes by Argo 4. State persisted to S3
If I were to stop (drain and take a savepoint) and resume, I'll have to update the job manager manifest with the savepoint location and save it in Github and redeploy. After deployment, I'll presumably have to modify the manifest again to remove the savepoint location so as to avoid starting the application from the same savepoint. This raises some questions: 1. If the job manager were to crash before the manifest is updated again then won't kubernetes restart the job manager from the savepoint rather than the latest checkpoint? 2. Is there a way to ensure that restoration from a savepoint doesn't happen more than once? Or not after first successful checkpoint? 3. If even one checkpoint has been finalized, then the job should prefer the checkpoint rather than the savepoint. Will that happen automatically given zookeeper? 4. Is it possible to not have to remove the savepoint path from the kubernetes manifest and simply rely on newer checkpoints/savepoints? It feels rather clumsy to have to add and remove back manually. We could use a cron job to remove it but its still clumsy. 5. Is there a way of asking flink to use the latest savepoint rather than specifying the location of the savepoint? If I were to manually rename the s3 savepoint location to something fixed (s3://fixed_savepoint_path_always) then would there be any problem restoring the job? 6. Any open source tool that solves this problem? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/