Hello,
We are running a single job in a flink 1.11.1 cluster on a k8s cluster. We
use zookeeper HA mode.

To upgrade our application code, we do a flink cli job cancel with
savepoint. We then bring down the whole flink cluster. We bring it back up
and submit the new app code with this savepoint.

Here's a specific scenario:
1. A checkpoint was initiated by the flink infra.
2. We triggered a cancel with savepoint while the checkpoint was in
progress.
3. Based on logs, the checkpoint completes and immediately after this the
savepoint also seems to complete. At this point, my expectation is that
zookeeper would have no state for this job on this cluster.
4. The new cluster comes up. We submit a job from our savepoint. However,
the old job also seems to have been recovered! The UI shows this job. The
logs also seem to indicate this.
Please see a list of interesting events:

21:09:28 Starting job 2ddc7c290891ec2d169068d1992586d4 from savepoint …….

Jun 17, 2022 @ 21:09:25.036 Submitting Job with
JobId=2ddc7c290891ec2d169068d1992586d4.

21:08:27 Recovered JobGraph(jobId: 28e0ef806b40c27111614081e18d72f9)

21:08:27 Successfully recovered 1 persisted job graphs.

21:07:27 Starting standalonesession dameon on ….

21:07:25 New jobmanager pod comes up


21:07:14 Last message seen from old manager job

21:07:00 Cancelling tasks to cancelled messages

21:06:42 savepoint stored in ….

21:05:16 Last message of type Received last message for now expired
checkpoint attempt 101289

21:04:52 Received late message for now expired checkpoint attempt 101289 ….

21:04:49 Triggering checkpoint 101290 (type=SAVEPOINT)

21:04:48: ERROR
org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy:
Could not properly discard states.

21:04:48 ERROR
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory: Could
not delete the checkpoint stream file

21:04:47 Submitting Job with JobId=2ddc7c290891ec2d169068d1992586d4.

21:04:37 Triggering checkpoint 101289 (type=CHECKPOINT)


I don't see any zookeeper errors around this time(server or flink logs).
The ERROR events(21:04:48) are interesting. Although, it's much before the
savepoint completion (21:06:42).


What if anything could i be possibly doing wrong? We could try to clean out
the zookeeper state prior to job submission as a safety measure. But, i
would have expected this to work neverthless.


Thanks

Sudharsan

Reply via email to