Re: [PR] [FLINK-33565][Scheduler] ConcurrentExceptions works with exception merging [flink]

via GitHub Wed, 17 Jan 2024 19:46:33 -0800


1996fanrui commented on code in PR #24003:
URL: https://github.com/apache/flink/pull/24003#discussion_r1456812210



##########
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/ExecutionFailureHandlerTest.java:
##########
@@ -183,6 +194,55 @@ void testNonRecoverableFailureHandlingResult() throws 
Exception {
         assertThat(executionFailureHandler.getNumberOfRestarts()).isZero();
     }
 
+    /** Test isNewAttempt of {@link FailureHandlingResult} is expected. */
+    @Test
+    void testNewAttemptAndNumberOfRestarts() throws Exception {

Review Comment:
   Good suggestion!
   
   Also, I extracted the `assertHandlerRootException` and 
`assertHandlerConcurrentException` methods, after that, the 
`testNewAttemptAndNumberOfRestarts` is totally simple.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/FixedDelayRestartBackoffTimeStrategy.java:
##########
@@ -61,8 +61,9 @@ public long getBackoffTime() {
     }
 
     @Override
-    public void notifyFailure(Throwable cause) {
+    public boolean notifyFailure(Throwable cause) {
         currentRestartAttempt++;
+        return true;

Review Comment:
   > But from a conceptual point of view: shouldn't we cover it also for the 
two other restart strategies. Or am I missing something here?
   
   In the beginning, I want to improve all restart strategies, but I meet some 
feedback to removing other restart strategies during discussion. The core 
background is this thread: 
https://lists.apache.org/thread/l7wyc7pndpsvh2h7hj3fw2td9yphrlox
   
   In brief, 3 reasons:
   
   1. The semantics of option
   
       - the failure-rate strategy's restart upper limit option is named  
`restart-strategy.failure-rate.max-failures-per-interval`
       - It's  `max-failures-per-interval` instead of 
`max-attempts-per-interval`.
       - If we improve it directly, the name and behaviour aren't matched.
   2. We recommend users use the `exponential-delay restart strategy` in the 
future, it's more powerful.
   3. After FLIP-360 discussion, I found `exponential-delay restart strategy` 
can replace other restart strategies directly if users set the 
`restart-strategy.exponential-delay.backoff-multiplier` = 1
   
   Actually, I think other restart-strategies can be deprecated in the future.



##########
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/exceptionhistory/RootExceptionHistoryEntryTest.java:
##########
@@ -81,10 +81,14 @@ void testFromFailureHandlingResultSnapshot() throws 
ExecutionException, Interrup
         final CompletableFuture<Map<String, String>> rootFailureLabels =
                 
CompletableFuture.completedFuture(Collections.singletonMap("key", "value"));
 
-        final Throwable concurrentException = new 
IllegalStateException("Expected other failure");
-        final ExecutionVertex concurrentlyFailedExecutionVertex = 
extractExecutionVertex(1);
-        final long concurrentExceptionTimestamp =
-                triggerFailure(concurrentlyFailedExecutionVertex, 
concurrentException);
+        final Throwable concurrentException1 = new 
IllegalStateException("Expected other failure1");
+        final ExecutionVertex concurrentlyFailedExecutionVertex1 = 
extractExecutionVertex(1);
+        Predicate<ExceptionHistoryEntry> exception1Predicate =
+                getExceptionHistoryEntryPredicate(
+                        concurrentException1, 
concurrentlyFailedExecutionVertex1);
+
+        final Throwable concurrentException2 = new 
IllegalStateException("Expected other failure2");
+        final ExecutionVertex concurrentlyFailedExecutionVertex2 = 
extractExecutionVertex(2);

Review Comment:
   Thanks for the explanation!
   
   > testAddConecurrentExceptions will use all code of this test 
   
   If we have 2 tests, and test1 only call `testCommon`, and test2 call 
`testCommon` and `testPart2`. It means test2 can cover test1.
   
   2 solutions:
   - keep test1 and test2
   - Only keep test2
   
   In the current scenario, is it enough for us to only keep test2? Looking 
forward to your opinion, fine with me as well. 😁



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-33565][Scheduler] ConcurrentExceptions works with exception merging [flink]

Reply via email to