[ https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Panagiotis Garefalakis updated FLINK-33121: ------------------------------------------- Description: We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}). However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like: {code:java} The taskName must not be null for a non-global failure. {code} A solution to this could be to ignore Global failures while being in a Restarting phase on the Adaptive scheduler. This PR also fixes a smaller bug where we dont pass the [taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104] properly. Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step) was: We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}) -- However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of {{{}assertLocalExceptionInfo{}}}). A solution to this could be to ignore Global failures while being in a Restarting phase on the Adaptive scheduler. This PR also fixes a smaller bug where we dont pass the [taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104] properly. Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step) > Failed precondition in JobExceptionsHandler due to concurrent global failures > ----------------------------------------------------------------------------- > > Key: FLINK-33121 > URL: https://issues.apache.org/jira/browse/FLINK-33121 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Panagiotis Garefalakis > Assignee: Panagiotis Garefalakis > Priority: Major > Labels: pull-request-available > > We make the assumption that Global Failures (with null Task name) may only be > RootExceptions and and Local/Task exception may be part of concurrent > exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}). > However, when the Adaptive scheduler is in a Restarting phase due to an > existing failure (that is now the new Root) we can still, in rare occasions, > capture new Global failures, violating this condition (with an assertion is > thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like: > {code:java} > The taskName must not be null for a non-global failure. {code} > A solution to this could be to ignore Global failures while being in a > Restarting phase on the Adaptive scheduler. > This PR also fixes a smaller bug where we dont pass the > [taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104] > properly. > Note: DefaultScheduler does not suffer from this issue as it treats failures > directly as HistoryEntries (no conversion step) -- This message was sent by Atlassian Jira (v8.20.10#820010)