Nico Kruber created FLINK-25027:
-----------------------------------

             Summary: Allow GC of a finished job's JobMaster before the slot 
timeout is reached
                 Key: FLINK-25027
                 URL: https://issues.apache.org/jira/browse/FLINK-25027
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.13.3, 1.12.5, 1.14.0
            Reporter: Nico Kruber
         Attachments: image-2021-11-23-20-32-20-479.png

In a session cluster, after a (batch) job is finished, the JobMaster seems to 
stick around for another couple of minutes before being eligible for garbage 
collection.

Looking into a heap dump, it seems to be tied to a 
{{PhysicalSlotRequestBulkCheckerImpl}} which is enqueued in the underlying Akka 
executor (and keeps the JM from being GC’d). Per default the action is 
scheduled for {{slot.request.timeout}} that defaults to 5 min (thanks 
[~trohrmann] for helping out here)

!image-2021-11-23-20-32-20-479.png!

With this setting, you will have to account for enough metaspace to cover 5 
minutes of time which may span a couple of jobs, needlessly!


The problem seems to be that Flink is using the main thread executor for the 
scheduling that uses the {{ActorSystem}}'s scheduler and the future task 
scheduled with Akka can (probably) not be easily cancelled.
One idea could be to use a dedicated thread pool per JM, that we shut down when 
the JM terminates. That way we would not keep the JM from being GC’d.


(The concrete example we investigated was a DataSet job)







--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to