Hi community,
*General setup*We are running flink standalone job on k8s, We start our job manager and task manager with jar immediately with the following command: > /docker-entrypoint.sh standalone-job --host $1 --fromSavepoint > /opt/flink/shared/savepoints/${SAVEPOINT}/ --allowNonRestoredState $2 (meaning our task/job manager pods are combined with our job/service and not running separately) Were using k8s prestop.sh script that saving savepoint before stopping our job manager: #!/bin/bash /opt/flink/bin/flink stop --savepointPath /opt/flink/shared/savepoints/ 000000006e6b13320000000000000000 Thus when we deploy our job or remove the job manager pod the savepoint is created and after start-up our flink service is recovering from the savepoint, in case our app is restarted the job will recover from the checkpoint. Remark - a few days ago we moved to flink 1.17 but the problem existing also on flink 1.16 *Test scenario*Running traffic incoming from kafka to our service which is suppose to write a file and removing the task and job manager pods, before the service is able to complete to write the file, and verify that after flink recovery the service complete his work thus the file completely written *Problem encountered *The test scenario work perfectly when recovering from checkpoint, in case of trying to recover from savepoint, the service load from the savepoint without any error but the state came with null value - state = transcodingState.value(); if (state == null) { log.info("unable to pull state, creating new"); state = new TranscodingState(); transcodingState.update(state); } and the log above is written meaning the state is null. We also tried to change the command to recover from savepoint to -s instead of --fromSavepoint but the result was the same. Appreciate if someone can assist for come up with any idea Best Regards Ariel