[ 
https://issues.apache.org/jira/browse/FLINK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

john updated FLINK-25832:
-------------------------
    Description: 
I deployed a standalone flink cluster on k8s and enabled 
scheduler-mode=reactive. When Taskmanager is closed, I actively call the 
closeTaskManagerConnection method of ResourceManager. However, when 
AdaptiveScheduler actively starts to restart the job, it calls the cancel 
method of Execution at this time, but this method does not judge whether the 
status of its associated slot is Alive. The Taskmanager to which this slot 
belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the 
status of the slot is Alive before cancel, repeating the above operation is 
still invalid, that is, RpcTimeout will still be triggered. My problem is: 
Active in the ResourceManager's closeTaskManagerConnection method, does not 
affect the state of its associated allocated slot. I think this is a bug. We 
should optimize the behavior of cancel to speed up the execution of cancel.

!image-2022-01-27-10-55-14-758.png!!image-2022-01-27-10-55-59-119.png!

!image-2022-01-27-10-57-26-223.png!

  was:
I deployed a standalone flink cluster on k8s and enabled 
scheduler-mode=reactive. When Taskmanager is closed, I actively call the 
closeTaskManagerConnection method of ResourceManager. However, when 
AdaptiveScheduler actively starts to restart the job, it calls the cancel 
method of Execution at this time, but this method does not judge whether the 
status of its associated slot is Alive. The Taskmanager to which this slot 
belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the 
status of the slot is Alive before cancel, repeating the above operation is 
still invalid, that is, RpcTimeout will still be triggered. My problem is: 
Active in the ResourceManager's closeTaskManagerConnection method, does not 
affect the state of its associated allocated slot. I think this is a bug. We 
should optimize the behavior of cancel to speed up the execution of cancel.

!image-2022-01-27-10-55-59-119.png!

!image-2022-01-27-10-57-26-223.png!!image-2022-01-27-10-55-14-758.png!


> When the TaskManager is closed, its associated slot is not set to the 
> released state.
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-25832
>                 URL: https://issues.apache.org/jira/browse/FLINK-25832
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.14.2, 1.14.3
>            Reporter: john
>            Priority: Major
>         Attachments: image-2022-01-27-10-55-14-758.png, 
> image-2022-01-27-10-55-59-119.png, image-2022-01-27-10-57-26-223.png
>
>
> I deployed a standalone flink cluster on k8s and enabled 
> scheduler-mode=reactive. When Taskmanager is closed, I actively call the 
> closeTaskManagerConnection method of ResourceManager. However, when 
> AdaptiveScheduler actively starts to restart the job, it calls the cancel 
> method of Execution at this time, but this method does not judge whether the 
> status of its associated slot is Alive. The Taskmanager to which this slot 
> belongs has been closed, and RpcTimeout is triggered at this time.
> But when I change the cancel method of Execution, after judging whether the 
> status of the slot is Alive before cancel, repeating the above operation is 
> still invalid, that is, RpcTimeout will still be triggered. My problem is: 
> Active in the ResourceManager's closeTaskManagerConnection method, does not 
> affect the state of its associated allocated slot. I think this is a bug. We 
> should optimize the behavior of cancel to speed up the execution of cancel.
> !image-2022-01-27-10-55-14-758.png!!image-2022-01-27-10-55-59-119.png!
> !image-2022-01-27-10-57-26-223.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to