[ 
https://issues.apache.org/jira/browse/FLINK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078019#comment-17078019
 ] 

Robert Metzger edited comment on FLINK-16423 at 4/8/20, 10:07 AM:
------------------------------------------------------------------

I was finally able to isolate a failure with good logs. This issue seems to be 
related to the checkpointing.

here is the output from the e2e test:
{code}
2020-04-08T09:14:35.0313320Z Starting zookeeper daemon on host fv-az32.
2020-04-08T09:14:35.1677419Z Starting HA cluster with 1 masters.
2020-04-08T09:14:35.5878680Z Starting standalonesession daemon on host fv-az32.
2020-04-08T09:14:38.0473360Z Starting taskexecutor daemon on host fv-az32.
2020-04-08T09:14:38.0871778Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:39.1349049Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:40.6058895Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:41.7923885Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:42.8574257Z Dispatcher REST endpoint is up.
2020-04-08T09:14:43.5236634Z Running on HA mode: parallelism=4, backend=file, 
asyncSnapshots=true, and incremSnapshots=false.
2020-04-08T09:14:53.2042561Z Job (8f0e52b7936bf3c3562869a1245c09a7) is running.
2020-04-08T09:14:53.2070245Z Running JM watchdog @ 39872
2020-04-08T09:14:58.2126283Z Running TM watchdog @ 40353
2020-04-08T09:14:58.2131214Z Waiting for text Completed checkpoint [1-9]* for 
job 8f0e52b7936bf3c3562869a1245c09a7 to appear 2 of times in logs...
2020-04-08T09:14:58.8598035Z Killed JM @ 38795
2020-04-08T09:14:58.8637495Z Waiting for text Completed checkpoint [1-9]* for 
job 8f0e52b7936bf3c3562869a1245c09a7 to appear 2 of times in logs...
2020-04-08T09:14:58.8950102Z grep: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonesession-1*.log:
 No such file or directory
2020-04-08T09:14:59.8373637Z Starting standalonesession daemon on host fv-az32.
2020-04-08T09:14:59.9034988Z grep: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonesession-1*.log:
 No such file or directory
2020-04-08T09:15:01.5369130Z grep: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonesession-1*.log:
 No such file or directory
2020-04-08T09:15:01.7818592Z Killed TM @ 39090
2020-04-08T09:15:48.1832528Z Killed TM @ 41002
2020-04-08T09:16:00.5940266Z Killed TM @ 42747
2020-04-08T09:24:34.8032228Z Test (pid: 38219) did not finish after 600 seconds.
{code}

I only checked the logs of the second standalone session, and I found the 
following:
- it successfully restores the last checkpoint from the first standalone 
session (checkpoint 5)
- It completes checkpoint 8 (not sure where 6,7 are?)
- it triggers checkpoint 9
- it has an artificial failure
- it fails trying to restore checkpoint 9 (!)(!)(!):
{code}
2020-04-08 09:15:45,639 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
SlidingWindowCheckMapper -> Sink: SlidingWindowCheckPrintSink (2/4) 
(27bca3ef87f434874ab87b5d7eede1b9) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:246)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:293)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:436)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:432)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:445) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:718) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:542) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state 
backend for StreamFlatMap_149799a3e2c39804818236cc493c243c_(2/4) from any of 
the 1 provided restore options.
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        ... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed when 
trying to restore heap backend
        at 
org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:116)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        ... 9 more
Caused by: java.io.FileNotFoundException: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-34578709671/checkpoints/8f0e52b7936bf3c3562869a1245c09a7/chk-9/f16bc9dd-7b4b-413a-9c8e-876d62acc46a
 (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_242]
        at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_242]
        at java.io.FileInputStream.<init>(FileInputStream.java:138) 
~[?:1.8.0_242]
        at 
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.KeyGroupsStateHandle.openInputStream(KeyGroupsStateHandle.java:112)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.heap.HeapRestoreOperation.restore(HeapRestoreOperation.java:125)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:114)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        ... 9 more
{code}

I attached the logs



was (Author: rmetzger):
I was finally able to isolate a failure with good logs. This issue seems to be 
related to the state backends.

here is the output from the e2e test:
{code}
2020-04-08T09:14:35.0313320Z Starting zookeeper daemon on host fv-az32.
2020-04-08T09:14:35.1677419Z Starting HA cluster with 1 masters.
2020-04-08T09:14:35.5878680Z Starting standalonesession daemon on host fv-az32.
2020-04-08T09:14:38.0473360Z Starting taskexecutor daemon on host fv-az32.
2020-04-08T09:14:38.0871778Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:39.1349049Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:40.6058895Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:41.7923885Z Waiting for Dispatcher REST endpoint to come up...
2020-04-08T09:14:42.8574257Z Dispatcher REST endpoint is up.
2020-04-08T09:14:43.5236634Z Running on HA mode: parallelism=4, backend=file, 
asyncSnapshots=true, and incremSnapshots=false.
2020-04-08T09:14:53.2042561Z Job (8f0e52b7936bf3c3562869a1245c09a7) is running.
2020-04-08T09:14:53.2070245Z Running JM watchdog @ 39872
2020-04-08T09:14:58.2126283Z Running TM watchdog @ 40353
2020-04-08T09:14:58.2131214Z Waiting for text Completed checkpoint [1-9]* for 
job 8f0e52b7936bf3c3562869a1245c09a7 to appear 2 of times in logs...
2020-04-08T09:14:58.8598035Z Killed JM @ 38795
2020-04-08T09:14:58.8637495Z Waiting for text Completed checkpoint [1-9]* for 
job 8f0e52b7936bf3c3562869a1245c09a7 to appear 2 of times in logs...
2020-04-08T09:14:58.8950102Z grep: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonesession-1*.log:
 No such file or directory
2020-04-08T09:14:59.8373637Z Starting standalonesession daemon on host fv-az32.
2020-04-08T09:14:59.9034988Z grep: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonesession-1*.log:
 No such file or directory
2020-04-08T09:15:01.5369130Z grep: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT/log/*standalonesession-1*.log:
 No such file or directory
2020-04-08T09:15:01.7818592Z Killed TM @ 39090
2020-04-08T09:15:48.1832528Z Killed TM @ 41002
2020-04-08T09:16:00.5940266Z Killed TM @ 42747
2020-04-08T09:24:34.8032228Z Test (pid: 38219) did not finish after 600 seconds.
{code}

I only checked the logs of the second standalone session, and I found the 
following:
- it successfully restores the last checkpoint from the first standalone 
session (checkpoint 5)
- It completes checkpoint 8 (not sure where 6,7 are?)
- it triggers checkpoint 9
- it has an artificial failure
- it fails trying to restore checkpoint 9 (!)(!)(!):
{code}
2020-04-08 09:15:45,639 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - 
SlidingWindowCheckMapper -> Sink: SlidingWindowCheckPrintSink (2/4) 
(27bca3ef87f434874ab87b5d7eede1b9) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:246)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:293)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:436)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:432)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:445) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:718) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:542) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state 
backend for StreamFlatMap_149799a3e2c39804818236cc493c243c_(2/4) from any of 
the 1 provided restore options.
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        ... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed when 
trying to restore heap backend
        at 
org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:116)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        ... 9 more
Caused by: java.io.FileNotFoundException: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-34578709671/checkpoints/8f0e52b7936bf3c3562869a1245c09a7/chk-9/f16bc9dd-7b4b-413a-9c8e-876d62acc46a
 (No such file or directory)
        at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_242]
        at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_242]
        at java.io.FileInputStream.<init>(FileInputStream.java:138) 
~[?:1.8.0_242]
        at 
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) 
~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.KeyGroupsStateHandle.openInputStream(KeyGroupsStateHandle.java:112)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.heap.HeapRestoreOperation.restore(HeapRestoreOperation.java:125)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:114)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
        ... 9 more
{code}

I attached the logs


> test_ha_per_job_cluster_datastream.sh gets stuck
> ------------------------------------------------
>
>                 Key: FLINK-16423
>                 URL: https://issues.apache.org/jira/browse/FLINK-16423
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Tests
>    Affects Versions: 1.11.0
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>            Priority: Blocker
>
> This was seen in 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5905&view=logs&j=b1623ac9-0979-5b0d-2e5e-1377d695c991&t=e7804547-1789-5225-2bcf-269eeaa37447
>  ... the relevant part of the logs is here:
> {code}
> 2020-03-04T11:27:25.4819486Z 
> ==============================================================================
> 2020-03-04T11:27:25.4820470Z Running 'Running HA per-job cluster (rocks, 
> non-incremental) end-to-end test'
> 2020-03-04T11:27:25.4820922Z 
> ==============================================================================
> 2020-03-04T11:27:25.4840177Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-25482960156
> 2020-03-04T11:27:25.6712478Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-04T11:27:25.6830402Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-04T11:27:26.2988914Z Starting zookeeper daemon on host fv-az655.
> 2020-03-04T11:27:26.3001237Z Running on HA mode: parallelism=4, 
> backend=rocks, asyncSnapshots=true, and incremSnapshots=false.
> 2020-03-04T11:27:27.4206924Z Starting standalonejob daemon on host fv-az655.
> 2020-03-04T11:27:27.4217066Z Start 1 more task managers
> 2020-03-04T11:27:30.8412541Z Starting taskexecutor daemon on host fv-az655.
> 2020-03-04T11:27:38.1779980Z Job (00000000000000000000000000000000) is 
> running.
> 2020-03-04T11:27:38.1781375Z Running JM watchdog @ 89778
> 2020-03-04T11:27:38.1781858Z Running TM watchdog @ 89779
> 2020-03-04T11:27:38.1783272Z Waiting for text Completed checkpoint [1-9]* for 
> job 00000000000000000000000000000000 to appear 2 of times in logs...
> 2020-03-04T13:21:29.9076797Z ##[error]The operation was canceled.
> 2020-03-04T13:21:29.9094090Z ##[section]Finishing: Run e2e tests
> {code}
> The last three lines indicate that the test is waiting forever for a 
> checkpoint to appear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to