Hi Yu'an,
We use flink 1.11.1. This version has a 'cancel' option in the CLI (
https://nightlies.apache.org/flink/flink-docs-release-1.11/ops/cli.html)
So, we do flink cancel -s <savepoint location> <jobId>. We have had
innumerable 'job cancels'  during deployments and we have never seen
anything like the sequence above. So, it's very odd.

Thanks
Sudharsan


On Sun, Jun 19, 2022 at 2:22 AM yu'an huang <[email protected]> wrote:

> Hi Sudharsan,
>
> How did you cancel thus single job. According to the High Availability
> Document:
>
> “In order to recover submitted jobs, Flink persists metadata and the job
> artifacts. The HA data will be kept until the respective job either
> succeeds, is cancelled or fails terminally. Once this happens, all the HA
> data, including the metadata stored in the HA services, will be deleted."
>
> So I think the job data should be deleted if you use the action “cancel”
> (instead of “stop") to cancel the job. Also I paste the HA and savepoint
> doc link below, hopes these may help you.
> HA:
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/ha/overview/
> Savepoint:
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/
>
>
> Best,
> Yuan
>
>
>
> On 19 Jun 2022, at 12:51 AM, Sudharsan R <[email protected]> wrote:
>
> Hello,
> We are running a single job in a flink 1.11.1 cluster on a k8s cluster. We
> use zookeeper HA mode.
>
> To upgrade our application code, we do a flink cli job cancel with
> savepoint. We then bring down the whole flink cluster. We bring it back up
> and submit the new app code with this savepoint.
>
> Here's a specific scenario:
> 1. A checkpoint was initiated by the flink infra.
> 2. We triggered a cancel with savepoint while the checkpoint was in
> progress.
> 3. Based on logs, the checkpoint completes and immediately after this the
> savepoint also seems to complete. At this point, my expectation is that
> zookeeper would have no state for this job on this cluster.
> 4. The new cluster comes up. We submit a job from our savepoint. However,
> the old job also seems to have been recovered! The UI shows this job. The
> logs also seem to indicate this.
> Please see a list of interesting events:
> 21:09:28 Starting job 2ddc7c290891ec2d169068d1992586d4 from savepoint …….
> Jun 17, 2022 @ 21:09:25.036 Submitting Job with
> JobId=2ddc7c290891ec2d169068d1992586d4.
> 21:08:27 Recovered JobGraph(jobId: 28e0ef806b40c27111614081e18d72f9)
> 21:08:27 Successfully recovered 1 persisted job graphs.
> 21:07:27 Starting standalonesession dameon on ….
> 21:07:25 New jobmanager pod comes up
>
> 21:07:14 Last message seen from old manager job
> 21:07:00 Cancelling tasks to cancelled messages
> 21:06:42 savepoint stored in ….
> 21:05:16 Last message of type Received last message for now expired
> checkpoint attempt 101289
> 21:04:52 Received late message for now expired checkpoint attempt 101289 ….
> 21:04:49 Triggering checkpoint 101290 (type=SAVEPOINT)
> 21:04:48: ERROR
> org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy:
> Could not properly discard states.
> 21:04:48 ERROR
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory: Could
> not delete the checkpoint stream file
> 21:04:47 Submitting Job with JobId=2ddc7c290891ec2d169068d1992586d4.
>
> 21:04:37 Triggering checkpoint 101289 (type=CHECKPOINT)
>
> I don't see any zookeeper errors around this time(server or flink logs).
> The ERROR events(21:04:48) are interesting. Although, it's much before the
> savepoint completion (21:06:42).
>
> What if anything could i be possibly doing wrong? We could try to clean
> out the zookeeper state prior to job submission as a safety measure. But, i
> would have expected this to work neverthless.
>
> Thanks
> Sudharsan
>
>
>

Reply via email to