[ 
https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated FLINK-33121:
-------------------------------------------
    Description: 
We make the assumption that Global Failures (with null Task name) may only be 
RootExceptions and and Local/Task exception may be part of concurrent 
exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}).
However, when the Adaptive scheduler is in a Restarting phase due to an 
existing failure (that is now the new Root) we can still, in rare occasions, 
capture new Global failures, violating this condition (with an assertion is 
thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like:
{code:java}
The taskName must not be null for a non-global failure.  {code}
We want to ignore Global failures while being in a Restarting phase on the 
Adaptive scheduler until we properly support multiple Global failures in the 
Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922

Note: DefaultScheduler does not suffer from this issue as it treats failures 
directly as HistoryEntries (no conversion step)

  was:
We make the assumption that Global Failures (with null Task name) may only be 
RootExceptions and and Local/Task exception may be part of concurrent 
exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}).
However, when the Adaptive scheduler is in a Restarting phase due to an 
existing failure (that is now the new Root) we can still, in rare occasions, 
capture new Global failures, violating this condition (with an assertion is 
thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like:
{code:java}
The taskName must not be null for a non-global failure.  {code}
We want to ignore Global failures while being in a Restarting/Canceling or 
Failing phase on the Adaptive scheduler until we properly support multiple 
Global failures in the Exception History as part of 
https://issues.apache.org/jira/browse/FLINK-34922

Note: DefaultScheduler does not suffer from this issue as it treats failures 
directly as HistoryEntries (no conversion step)


> Failed precondition in JobExceptionsHandler due to concurrent global failures
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-33121
>                 URL: https://issues.apache.org/jira/browse/FLINK-33121
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>
> We make the assumption that Global Failures (with null Task name) may only be 
> RootExceptions and and Local/Task exception may be part of concurrent 
> exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}).
> However, when the Adaptive scheduler is in a Restarting phase due to an 
> existing failure (that is now the new Root) we can still, in rare occasions, 
> capture new Global failures, violating this condition (with an assertion is 
> thrown as part of {{{}assertLocalExceptionInfo{}}}) seeing something like:
> {code:java}
> The taskName must not be null for a non-global failure.  {code}
> We want to ignore Global failures while being in a Restarting phase on the 
> Adaptive scheduler until we properly support multiple Global failures in the 
> Exception History as part of https://issues.apache.org/jira/browse/FLINK-34922
> Note: DefaultScheduler does not suffer from this issue as it treats failures 
> directly as HistoryEntries (no conversion step)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to