[ https://issues.apache.org/jira/browse/FLINK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923162#comment-16923162 ]
Till Rohrmann commented on FLINK-13962: --------------------------------------- The improvement proposal sounds good to me [~zhuzh]. Do you have time to work on it? > Task state handles leak if the task fails before deploying > ---------------------------------------------------------- > > Key: FLINK-13962 > URL: https://issues.apache.org/jira/browse/FLINK-13962 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.9.0, 1.10.0 > Reporter: Zhu Zhu > Priority: Major > > Currently the taskRestore field of an _Execution_ is reset to null in task > deployment stage. > The purpose of it is "allows the JobManagerTaskRestore instance to be garbage > collected. Furthermore, it won't be archived along with the Execution in the > ExecutionVertex in case of a restart. This is especially important when > setting state.backend.fs.memory-threshold to larger values because every > state below this threshold will be stored in the meta state files and, thus, > also the JobManagerTaskRestore instances." (From FLINK-9693) > > However, if a task fails before it comes to the deployment stage(e.g. fails > due to slot allocation timeout), the _taskRestore_ field will remain non-null > and will be archived in prior executions. > This may result in large JM heap cost in certain cases and lead to continuous > JM full GCs. > > I’d propose to set the _taskRestore_ field to be null before moving an > _Execution_ to prior executions. > We may keep the logic which sets the _taskRestore_ field to be null after > task deployment which allows it to be GC'ed earlier in normal cases. -- This message was sent by Atlassian Jira (v8.3.2#803003)