[ 
https://issues.apache.org/jira/browse/FLINK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901363#comment-16901363
 ] 

Yun Tang commented on FLINK-13601:
----------------------------------

[~aljoscha] After looking up the [detail 
log|https://transfer.sh/bNhvk/1452.8.tar.gz] of 
[flink-ci-failure|https://travis-ci.com/flink-ci/flink/jobs/222711830]. I am 
pretty sure that this is a test instability.

Currently, I use {{notifyCheckpointComplete}} interface to record 
{{snapshotIndicesOfSubTask}} and {{lastCompletedCheckpointId}} on task side. 
However, the checkpoint 15 completed just after this task failed, and 
{{notifyCheckpointComplete}} would not invoked on task side anymore since this 
the task has been failded. In other words, {{lastCompletedCheckpointId}} would 
only be record as last complete checkpoint 14, which leads to the failure.
 Below is related code:
{code:java}
@Override
public void notifyCheckpointComplete(long checkpointId) {
        if (getRuntimeContext().getIndexOfThisSubtask() == NUM_OF_REGIONS - 1) {
                lastCompletedCheckpointId.set(checkpointId);
                snapshotIndicesOfSubTask.put(checkpointId, lastRegionIndex);
                numCompletedCheckpoints.incrementAndGet();
        }
}
{code}
And below is the logs which record this situation.
{code:java}
10:44:38,755 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
 - Triggering checkpoint 15 @ 1565088278754 for job 
9e0fbeaa580123e05cfce5554f443d23.
10:44:38,762 INFO  org.apache.flink.runtime.taskmanager.Task                    
 - Map (3/3) (fb073c45e552de69f06e8037f581491c) switched from RUNNING to FAILED.
org.apache.flink.test.checkpointing.RegionFailoverITCase$TestException
        at 
org.apache.flink.test.checkpointing.RegionFailoverITCase$FailingMapperFunction.map(RegionFailoverITCase.java:344)
        at 
org.apache.flink.test.checkpointing.RegionFailoverITCase$FailingMapperFunction.map(RegionFailoverITCase.java:315)
        at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
        at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processElement(StreamOneInputProcessor.java:164)
        at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:143)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performDefaultAction(StreamTask.java:269)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.execution.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:140)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:688)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:518)
        at java.lang.Thread.run(Thread.java:748)
10:44:38,776 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - Calculating tasks to restart to recover the failed task 
5c51e52cde5a1c4df827ddb38fbc8da9_2.
10:44:38,776 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - 3 tasks should be restarted to recover the failed task 
5c51e52cde5a1c4df827ddb38fbc8da9_2. 
10:44:38,776 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       
 - Source: Custom Source (3/3) (1a7ad9441ead7561b1a92bef8d46e421) switched from 
RUNNING to CANCELING.
10:44:38,776 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       
 - Sink: Unnamed (3/3) (6cdcfb1522f95345e0afbd32e82aaa68) switched from RUNNING 
to CANCELING.
10:44:38,777 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       
 - Discarding the results produced by task execution 
fb073c45e552de69f06e8037f581491c.
10:44:38,777 INFO  org.apache.flink.runtime.taskmanager.Task                    
 - Attempting to cancel task Source: Custom Source (3/3) 
(1a7ad9441ead7561b1a92bef8d46e421).
10:44:38,780 INFO  org.apache.flink.runtime.taskmanager.Task                    
 - Source: Custom Source (3/3) (1a7ad9441ead7561b1a92bef8d46e421) switched from 
RUNNING to CANCELING.
10:44:38,780 INFO  org.apache.flink.runtime.taskmanager.Task                    
 - Triggering cancellation of task code Source: Custom Source (3/3) 
(1a7ad9441ead7561b1a92bef8d46e421).
10:44:38,780 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
 - Completed checkpoint 15 for job 9e0fbeaa580123e05cfce5554f443d23 (35251 
bytes in 26 ms).
{code}
I plan to use a customized {{CompletedCheckpointStore}} to record information 
instead of relaying on the {{notifyCheckpointComplete}} invoked on task side to 
fix this problem. Pleas assign this ticket to me.

> RegionFailoverITCase is unstable
> --------------------------------
>
>                 Key: FLINK-13601
>                 URL: https://issues.apache.org/jira/browse/FLINK-13601
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Aljoscha Krettek
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.9.0
>
>
> Excerpt from https://travis-ci.com/flink-ci/flink/jobs/222711830:
> {code}
> 10:44:31.222 [INFO] Running 
> org.apache.flink.test.checkpointing.RegionFailoverITCase
> org.apache.flink.client.program.ProgramInvocationException: Job failed 
> (JobID: 9e0fbeaa580123e05cfce5554f443d23)
>     at 
> org.apache.flink.client.program.MiniClusterClient.submitJob(MiniClusterClient.java:92)
>     at 
> org.apache.flink.test.checkpointing.RegionFailoverITCase.testMultiRegionFailover(RegionFailoverITCase.java:132)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>     at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>     at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>     at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>     at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>     at 
> org.apache.flink.client.program.MiniClusterClient.submitJob(MiniClusterClient.java:90)
>     ... 13 more
> Caused by: java.lang.RuntimeException: Test failed due to unexpected 
> recovered index: 2000, while last completed checkpoint record index: 1837
>     at 
> org.apache.flink.test.checkpointing.RegionFailoverITCase$StringGeneratingSourceFunction.initializeState(RegionFailoverITCase.java:300)
>     at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
>     at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
>     at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
>     at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:862)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:367)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:688)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:518)
>     ... 1 more
> 10:44:39.210 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time 
> elapsed: 7.983 s <<< FAILURE! - in 
> org.apache.flink.test.checkpointing.RegionFailoverITCase
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to