[ 
https://issues.apache.org/jira/browse/FLINK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078057#comment-17078057
 ] 

Biao Liu edited comment on FLINK-16423 at 4/8/20, 10:56 AM:
------------------------------------------------------------

Thanks [~rmetzger] for analyzing so deeply. I checked the attached log. I 
believe the scenario is same with FLINK-16770. The problem happens when 
checkpoint 9 is doing finalization. We can see that the 
{{CheckpointCoordinator}} tried to recover from checkpoint 9. So the checkpoint 
9 must be added into {{CompletedCheckpointStore}}. However we can't find the 
log "Completed checkpoint 9 ...". It must failed after being added into 
{{CompletedCheckpointStore}}, like being aborted due to the "artificial 
failure". Regarding to "where is the checkpoint 6, 7, 8", since we only keep 1 
successful checkpoint in {{CompletedCheckpointStore}}, they must be subsumed 
when checkpoint 9 was adding into {{CompletedCheckpointStore}}.

The work-around fixing so far of FLINK-16770 is that keeping 2 successful 
checkpoints in {{CompletedCheckpointStore}} for these cases. So even if 
checkpoint 9 doesn't finish the finalization, there should be at least 
checkpoint 8 existing.

If it gets stuck quite frequently, we could apply the work-around fixing for 
the case. However this bug has to be fixed completely before releasing 1.11.


was (Author: sleepy):
Thanks [~rmetzger] for analyzing so deeply. I checked the attached log. I 
believe the scenario is same with FLINK-16770. The problem happens when 
checkpoint 9 is doing finalization. We can see that the 
{{CheckpointCoordinator}} tried to recover from checkpoint 9. So the checkpoint 
9 must be added into {{CompletedCheckpointStore}}. However we can't find the 
log "Completed checkpoint 9 ...". It must failed after being added into 
{{CompletedCheckpointStore}}, like being aborted due to the "artificial 
failure". Regarding to "where is the checkpoint 6, 7, 8", since we only keep 1 
successful checkpoint in {{CompletedCheckpointStore}}, they must be subsumed 
when checkpoint 9 was adding into {{CompletedCheckpointStore}}. 

The work-around fixing so far of FLINK-16770 is that keeping 2 successful 
checkpoints in {{CompletedCheckpointStore}] for these cases. So even if 
checkpoint 9 doesn't finish the finalization, there should be at least 
checkpoint 8 existing.

If it gets stuck quite frequently, we could apply the work-around fixing for 
the case. However this bug has to be fixed completely before releasing 1.11.

> test_ha_per_job_cluster_datastream.sh gets stuck
> ------------------------------------------------
>
>                 Key: FLINK-16423
>                 URL: https://issues.apache.org/jira/browse/FLINK-16423
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Tests
>    Affects Versions: 1.11.0
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>            Priority: Blocker
>         Attachments: 20200408.1.tgz
>
>
> This was seen in 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5905&view=logs&j=b1623ac9-0979-5b0d-2e5e-1377d695c991&t=e7804547-1789-5225-2bcf-269eeaa37447
>  ... the relevant part of the logs is here:
> {code}
> 2020-03-04T11:27:25.4819486Z 
> ==============================================================================
> 2020-03-04T11:27:25.4820470Z Running 'Running HA per-job cluster (rocks, 
> non-incremental) end-to-end test'
> 2020-03-04T11:27:25.4820922Z 
> ==============================================================================
> 2020-03-04T11:27:25.4840177Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-25482960156
> 2020-03-04T11:27:25.6712478Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-04T11:27:25.6830402Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-04T11:27:26.2988914Z Starting zookeeper daemon on host fv-az655.
> 2020-03-04T11:27:26.3001237Z Running on HA mode: parallelism=4, 
> backend=rocks, asyncSnapshots=true, and incremSnapshots=false.
> 2020-03-04T11:27:27.4206924Z Starting standalonejob daemon on host fv-az655.
> 2020-03-04T11:27:27.4217066Z Start 1 more task managers
> 2020-03-04T11:27:30.8412541Z Starting taskexecutor daemon on host fv-az655.
> 2020-03-04T11:27:38.1779980Z Job (00000000000000000000000000000000) is 
> running.
> 2020-03-04T11:27:38.1781375Z Running JM watchdog @ 89778
> 2020-03-04T11:27:38.1781858Z Running TM watchdog @ 89779
> 2020-03-04T11:27:38.1783272Z Waiting for text Completed checkpoint [1-9]* for 
> job 00000000000000000000000000000000 to appear 2 of times in logs...
> 2020-03-04T13:21:29.9076797Z ##[error]The operation was canceled.
> 2020-03-04T13:21:29.9094090Z ##[section]Finishing: Run e2e tests
> {code}
> The last three lines indicate that the test is waiting forever for a 
> checkpoint to appear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to