[ 
https://issues.apache.org/jira/browse/FLINK-33483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784027#comment-17784027
 ] 

Xin Chen commented on FLINK-33483:
----------------------------------

*Biggest confusion is why Flink needs to design UNDEFINED*. From the 
perspective of the scenario, UNDEFINED is all due to exceptions (zk or jm 
exceptions,at all). Why can't we define failed? Defining FAILED allows us to 
determine and retry tasks, but UNDEFINED here has no meaning at all. Is there a 
better solution to this problem in subsequent versions of Flink, or how to 
better reproduce this scenario? My solution is the same as FLINK-12302, hoping 
to give a FALLED finalStatus to report to resourcemanager in this case, 
providing the user with the most clear reminder.  Even in this case, the task 
may have actually run successfully in TM(taskmanager), but after all, an 
exception (zk disconnection) has occurred. Anyway, executing a task and 
ultimately giving the user an UNDEFINED state can be confusing.

> Why is “UNDEFINED” defined in the Flink task status?
> ----------------------------------------------------
>
>                 Key: FLINK-33483
>                 URL: https://issues.apache.org/jira/browse/FLINK-33483
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / RPC, Runtime / Task
>    Affects Versions: 1.12.2
>            Reporter: Xin Chen
>            Priority: Major
>         Attachments: container_e15_1693914709123_8498_01_000001_8042
>
>
> In the Flink on Yarn mode, if an unknown status appears in the Flink log, 
> jm(jobmanager) will report the task status as undefined. The Yarn page will 
> display the state as FINISHED, but the final status is *UNDEFINED*. In terms 
> of business, it is unknown whether the task has failed or succeeded, and 
> whether to retry. It has a certain impact. Why should we design UNDEFINED? 
> Usually, this situation occurs due to zk(zookeeper) disconnection or jm 
> abnormality, etc. Since the abnormality is present, why not use FAILED?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to