Nico Kruber created FLINK-25027: ----------------------------------- Summary: Allow GC of a finished job's JobMaster before the slot timeout is reached Key: FLINK-25027 URL: https://issues.apache.org/jira/browse/FLINK-25027 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.13.3, 1.12.5, 1.14.0 Reporter: Nico Kruber Attachments: image-2021-11-23-20-32-20-479.png
In a session cluster, after a (batch) job is finished, the JobMaster seems to stick around for another couple of minutes before being eligible for garbage collection. Looking into a heap dump, it seems to be tied to a {{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying Akka executor (and keeps the JM from being GC’d). Per default the action is scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks [~trohrmann] for helping out here) !image-2021-11-23-20-32-20-479.png! With this setting, you will have to account for enough metaspace to cover 5 minutes of time which may span a couple of jobs, needlessly! The problem seems to be that Flink is using the main thread executor for the scheduling that uses the {{ActorSystem}}'s scheduler and the future task scheduled with Akka can (probably) not be easily cancelled. One idea could be to use a dedicated thread pool per JM, that we shut down when the JM terminates. That way we would not keep the JM from being GC’d. (The concrete example we investigated was a DataSet job) -- This message was sent by Atlassian Jira (v8.20.1#820001)