[jira] [Commented] (FLINK-37069) Cross-team verification for "Disaggregated State Management"

Weijie Guo (Jira) Mon, 17 Feb 2025 00:12:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927631#comment-17927631
 ]


Weijie Guo commented on FLINK-37069:
------------------------------------

Hi [~Zakelly], I have tested this according to the instruction.

1. Checkout and compile flink in commit has: dd4bd434
2. Start a standalone flink cluster
3. Set `execution.checkpointing.externalized-checkpoint-retention: 
RETAIN_ON_CANCELLATION` in flink conf
4. Run flink example

{code:java}
./bin/flink run ./examples/streaming/StateMachineExample.jar \
  --backend forst \
  --checkpoint-dir file:///cp \
  --incremental-checkpoints true 
{code}

5. Confirm checkpoint is triggered and completed, cancel this job
6. Restart from the latest cp

{code:java}
./bin/flink run -s file:///cp/ac252d10cfd0e70bc1142557f08132f4/chk-8 
./examples/streaming/StateMachineExample.jar \
  --backend forst \
  --checkpoint-dir file:///cp \
  --incremental-checkpoints true 
{code}

But the job failed with the following exception:

{code:java}
Caused by: java.lang.IllegalArgumentException: Unsupported sharing files 
strategy for 
org.apache.flink.state.forst.snapshot.ForStIncrementalSnapshotStrategy : FORWARD
        at 
org.apache.flink.state.forst.snapshot.ForStIncrementalSnapshotStrategy.asyncSnapshot(ForStIncrementalSnapshotStrategy.java:146)
 ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
        at 
org.apache.flink.state.forst.snapshot.ForStIncrementalSnapshotStrategy.asyncSnapshot(ForStIncrementalSnapshotStrategy.java:70)
 ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
        at 
org.apache.flink.runtime.state.SnapshotStrategyRunner.snapshot(SnapshotStrategyRunner.java:80)
 ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
        at 
org.apache.flink.state.forst.ForStKeyedStateBackend.snapshot(ForStKeyedStateBackend.java:484)
 ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:281)
 ~[flink-dist-2.0-SNAPSHOT.jar:2.0-SNAPSHOT]
{code}




> Cross-team verification for "Disaggregated State Management"
> ------------------------------------------------------------
>
>                 Key: FLINK-37069
>                 URL: https://issues.apache.org/jira/browse/FLINK-37069
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Xintong Song
>            Assignee: Weijie Guo
>            Priority: Blocker
>             Fix For: 2.0.0
>
>
> Instructions:
> First of all, please read the related documents briefly (still under review, 
> will replace with formal links if merged):
>  * Disaggregated State Management: 
> [https://github.com/apache/flink/pull/26107/files#diff-bfa19e04bb5c3487c3e9bf514d61c0fa8bb973950fb0ad0e3d4a6898a99b83e3]
>  * State V2: 
> [https://github.com/apache/flink/pull/26107/files#diff-5d1147987fecbda329132403c1d92384575be220092995c4be491e12b8c50cc9]
>  * ForSt State Backend: 
> [https://github.com/apache/flink/pull/26107/files#diff-b7c52c06f6ed4d5af6f230d11ba23ea051bf4a08c589d98392143f080c468a87]
> For the SQL part, verification goes in FLINK-37068, we mainly focus on 
> Datastream jobs and APIs here.
> 1. Make sure you are verifying this on release-2.0 branch, since we have 
> fixed several bugs since the rc0 package.
> 2. Choose one example in `flink-examples-streaming`. Most of the jobs has 
> been rewritten using new API. Here we take `StateMachineExample` as an 
> example.
> 3. Compile and run `StateMachineExample` in proper environment (I suggest a 
> standalone session cluster or yarn), make sure you have the following command 
> line params:
> {code:bash}
> ./flink run xxxxxxxxx \
>   --backend forst \
>   --checkpoint-dir s3://your/cp/dir \
>   --incremental-checkpoints true
> {code}
> Or set via `config.yaml`.
> {code:yaml}
> state.backend.type: forst
> execution.checkpointing.incremental: true
> execution.checkpointing.dir: s3://your-bucket/flink-checkpoints
> {code}
> 4. Check the job is running smoothly, the periodic checkpoints are 
> successfully taken.
> 5. Stop the job and restart from the latest checkpoint.
> It would be great if you could write your own job using State V2 API, and 
> follow the above Step 3~5. It is important to check whether there is any bug 
> in new State APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37069) Cross-team verification for "Disaggregated State Management"

Reply via email to