[ https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518243#comment-17518243 ]
Congxian Qiu commented on FLINK-26932: -------------------------------------- hi [~huwh] could you please update the affected versions > TaskManager hung in cleanupAllocationBaseDirs not exit. > ------------------------------------------------------- > > Key: FLINK-26932 > URL: https://issues.apache.org/jira/browse/FLINK-26932 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: huweihua > Priority: Major > Attachments: 1280X1280.png, > origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > > > The disk TaskManager used had some fatal error. And then TaskManager hung in > cleanupAllocationBaseDirs and took the main thread. > > So this TaskManager would not respond to the > cancelTask/disconnectResourceManager request. > > At the same time, JobMaster already take this TaskManager is lost, and > schedule task to other TaskManager. > > This may cause some unexpected task running. > > After checking the log of TaskManager, TM already lost the connection with > ResourceManager, and it is always trying to register with ResourceManager. > The RegistrationTimeout cannot take effect because the main thread of > TaskManager is hung-up. > > I think there are two options to handle it. > Option 1: Add timeout for > TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid > some other methods would block main thread too. > Option 2: Move the registrationTimeout in another thread, we need to deal > will the concurrency problem > > -- This message was sent by Atlassian Jira (v8.20.1#820001)