[ 
https://issues.apache.org/jira/browse/FLINK-25027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450246#comment-17450246
 ] 
Yangze Guo commented on FLINK-25027:
------------------------------------

Hi, [~zjureel], thanks for your proposal. I think this is an important 
improvement for users who run batch jobs with Flink. Here I just want to share 
two cents regarding your proposal:
- Compared to just adding a thread pool to JM, how about generalizing this 
mechanism in `RpcEndpoint`? If so, other components like TM and RM can also 
leverage it in scheduling periodic tasks.
- With your proposal, we can mitigate the waste of JM's metaspace. However, 
full GCs caused by those periodic tasks can also harm the performance in the 
batch scenario. Those periodic tasks are likely to be promoted to the old 
generation before being executed. I think we'd better have a unified solution 
for the periodic tasks, which can also mitigate such promotion, if possible.

> Allow GC of a finished job's JobMaster before the slot timeout is reached
> -------------------------------------------------------------------------
>
>                 Key: FLINK-25027
>                 URL: https://issues.apache.org/jira/browse/FLINK-25027
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Nico Kruber
>            Assignee: Shammon
>            Priority: Major
>             Fix For: 1.15.0, 1.14.1, 1.13.4
>
>         Attachments: image-2021-11-23-20-32-20-479.png
>
>
> In a session cluster, after a (batch) job is finished, the JobMaster seems to 
> stick around for another couple of minutes before being eligible for garbage 
> collection.
> Looking into a heap dump, it seems to be tied to a 
> {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying 
> Akka executor (and keeps the JM from being GC’d). Per default the action is 
> scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
> [~trohrmann] for helping out here)
> !image-2021-11-23-20-32-20-479.png!
> With this setting, you will have to account for enough metaspace to cover 5 
> minutes of time which may span a couple of jobs, needlessly!
> The problem seems to be that Flink is using the main thread executor for the 
> scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
> scheduled with Akka can (probably) not be easily cancelled.
> One idea could be to use a dedicated thread pool per JM, that we shut down 
> when the JM terminates. That way we would not keep the JM from being GC’d.
> (The concrete example we investigated was a DataSet job)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to