[ https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Panagiotis Garefalakis updated FLINK-33121: ------------------------------------------- Description: We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}). However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like: {code:java} The taskName must not be null for a non-global failure. {code} We want to ignore Global failures while being in a Restarting phase on the Adaptive scheduler until we properly support multiple Global failures in the Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922 Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step) was: We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}). However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like: {code:java} The taskName must not be null for a non-global failure. {code} We want to ignore Global failures while being in a Restarting/Canceling or Failing phase on the Adaptive scheduler until we properly support multiple Global failures in the Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922 Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step) > Failed precondition in JobExceptionsHandler due to concurrent global failures > ----------------------------------------------------------------------------- > > Key: FLINK-33121 > URL: https://issues.apache.org/jira/browse/FLINK-33121 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Panagiotis Garefalakis > Assignee: Panagiotis Garefalakis > Priority: Major > Labels: pull-request-available > > We make the assumption that Global Failures (with null Task name) may only be > RootExceptions and and Local/Task exception may be part of concurrent > exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}). > However, when the Adaptive scheduler is in a Restarting phase due to an > existing failure (that is now the new Root) we can still, in rare occasions, > capture new Global failures, violating this condition (with an assertion is > thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like: > {code:java} > The taskName must not be null for a non-global failure. {code} > We want to ignore Global failures while being in a Restarting phase on the > Adaptive scheduler until we properly support multiple Global failures in the > Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922 > Note: DefaultScheduler does not suffer from this issue as it treats failures > directly as HistoryEntries (no conversion step) -- This message was sent by Atlassian Jira (v8.20.10#820010)