[ https://issues.apache.org/jira/browse/FLINK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497438#comment-17497438 ]
Konstantin Knauf commented on FLINK-26273: ------------------------------------------ [~dwysakowicz] Looks very good. Here's what I did. Did I miss anything from your perspective? I opened two tickets (linked to this ticket) with CLI related issues. I also opened a hotfix PR with some documentation improvements: https://github.com/apache/flink/pull/18909 *Commit:* 0ff6f2cc78f 1. Taking Canonical/Native Savepoints/Retained Checkpoint Ran TopSpeedWindowing in Standalone Application Mode with RocksDB (incremental) and checkpoints retained on cancellation three times: * stopped with native savepoint (Savepoint ID: savepoint-b6dea9-ec57dcec988e) * stopped with canonical savepoint (Savepoint ID: savepoint-0d0bb8-3cfceefe4dec) * cancelled (JobID: c40b0839cfa6a454919597819e8e84f6) *Checkpoint Directory* {noformat} /tmp/flink-checkpoints ├── 0d0bb8faccf2eb8124d086a5355428a8 │ ├── shared │ └── taskowned ├── b6dea9642f5159f83c32eca3fc40082a │ ├── shared │ └── taskowned └── c40b0839cfa6a454919597819e8e84f6 ├── chk-13 │ └── _metadata ├── shared │ └── 1d438c44-c7a6-49c0-8053-1e5689a6df5c └── taskowned {noformat} *Savepoint Directory* {noformat} /tmp/flink-savepoints ├── savepoint-0d0bb8-3cfceefe4dec │ └── _metadata └── savepoint-b6dea9-ec57dcec988e ├── dd200786-54e3-4af3-a6f4-2943ff73bc14 └── _metadata {noformat} 2. *Two Jobs can be Started from Native Savepoint without Claiming and take a full checkpoint* Started 2 TopSpeedWindowing Jobs (aca6b1fc37c489d608b8ab9d562cd569 & 634d99afcf280d7e6eefd7d9f2b0ec37) without claiming from Native Savepoint and confirmed that a full snapshot was taken for both of them (I took the fact that the "Checkpointed Data Size"="Full Checkpoint Data Size" for the first checkpoint only as sign that this is the case.). Cancelled both jobs. 3. *Two Jobs can be Started from Retained Checkpoint without Claiming and take a full checkpoint* Like Step 2a just using the retained checkpoint from Step 1 instead of native savepoint. 4. *Job can be claim retained checkpoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up* Started TopSpeedWindowing with Claiming from the Retained Checkpoint of Step 1. Confirmed that the first Checkpoint is incremental and confirmed that the original checkpoint directory is empty after a few checkpoints. {code:bash} /tmp/flink-checkpoints/c40b0839cfa6a454919597819e8e84f6 ├── shared └── taskowned {code} 4. *Job can be claim moved, native savepoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up* Copied Native Savepoint from Step 1 to a different directory. Everything else like in 3. The directory of the moved Savepoint does not exist after a few checkpoints and the first checkpoint is incremental. 5. *Native Savepoint can be removed after first successful checkpoint and recovery still works" Started TopSpeedWindowing from Native Savepoint. After one checkpoint, removed savepoint, killed Taskmanager, restarted Taskmanager and Job recovered and continued checkpointing. > Test checkpoints restore modes & formats > ---------------------------------------- > > Key: FLINK-26273 > URL: https://issues.apache.org/jira/browse/FLINK-26273 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Reporter: Dawid Wysakowicz > Assignee: Konstantin Knauf > Priority: Blocker > Labels: release-testing > Fix For: 1.15.0 > > > We should test manually changes introduced in [FLINK-25276] & [FLINK-25154] > Proposal: > Take canonical savepoint/native savepoint/externalised checkpoint (with > RocksDB), and perform claim (1)/no claim (2) recoveries, and verify that in: > # after a couple of checkpoints claimed files have been cleaned up > # that after a single successful checkpoint, you can remove the start up > files and failover the job without any errors. > # take a native, incremental RocksDB savepoint, move to a different > directory, restore from it > documentation: > # > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#restore-mode > # > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#savepoint-format -- This message was sent by Atlassian Jira (v8.20.1#820001)