[jira] [Updated] (FLINK-31144) Slow scheduling on large-scale batch jobs

Julien Tournay (Jira) Mon, 20 Feb 2023 14:52:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julien Tournay updated FLINK-31144:
-----------------------------------
    Description: 
When executing a complex job graph at high parallelism 
`DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can get 
slow and cause long pauses where the JobManager becomes unresponsive and all 
the taskmanagers just wait. I've attached a VisualVM snapshot to illustrate the 
problem.[^flink-1.17-snapshot-1676473798013.nps]

At Spotify we have complex jobs where this issue can cause batch "pause" of 40+ 
minutes and make the overall execution 30% slower or more.
More importantly this prevent us from running said jobs on larger cluster as 
adding resources to the cluster worsen the issue.

We have successfully tested a modified Flink version where 
`DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was 
completely commented and simply returns an empty collection and confirmed it 
solves the issue.

In the same spirit as a recent change 
([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)]
 there could be a mechanism in place to detect when Flink run into this 
specific issue and just skip the call to `getInputLocationFutures`  
[https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.]

I'm not familiar enough with the internals of Flink to propose a more advanced 
fix, however it seems like a configurable threshold on the number of consumer 
vertices above which the preferred location is not computed would do. If this  
solution is good enough, I'd be happy to submit a PR.

  was:
When executing a complex job graph at high parallelism 
`DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can get 
slow and cause long pauses where the JobManager becomes unresponsive and all 
the taskmanagers just wait. I've attached a VisualVM snapshot to illustrate the 
problem.[^flink-1.17-snapshot-1676473798013.nps]

At Spotify we have complex jobs where this issue can cause batch "pause" of 40+ 
minutes and make the overall execution 30% slower or more.
More importantly this prevent us from running said jobs on larger cluster as 
adding resources to the cluster worsen the issue.

We have successfully tested a test where 
`DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was 
completely commented and simply returns an empty collection and confirmed it 
solves the issue.

In the same spirit as a recent change 
([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)]
 there could be a mechanism in place to detect when Flink run into this 
specific issue and just skip the call to `getInputLocationFutures`  
[https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.]

I'm not familiar enough with the internals of Flink to propose a more advanced 
fix, however it seems like a configurable threshold on the number of consumer 
vertices above which the preferred location is not computed would do. If this  
solution is good enough, I'd be happy to submit a PR.


> Slow scheduling on large-scale batch jobs 
> ------------------------------------------
>
>                 Key: FLINK-31144
>                 URL: https://issues.apache.org/jira/browse/FLINK-31144
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Julien Tournay
>            Priority: Major
>         Attachments: flink-1.17-snapshot-1676473798013.nps
>
>
> When executing a complex job graph at high parallelism 
> `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can 
> get slow and cause long pauses where the JobManager becomes unresponsive and 
> all the taskmanagers just wait. I've attached a VisualVM snapshot to 
> illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps]
> At Spotify we have complex jobs where this issue can cause batch "pause" of 
> 40+ minutes and make the overall execution 30% slower or more.
> More importantly this prevent us from running said jobs on larger cluster as 
> adding resources to the cluster worsen the issue.
> We have successfully tested a modified Flink version where 
> `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was 
> completely commented and simply returns an empty collection and confirmed it 
> solves the issue.
> In the same spirit as a recent change 
> ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)]
>  there could be a mechanism in place to detect when Flink run into this 
> specific issue and just skip the call to `getInputLocationFutures`  
> [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.]
> I'm not familiar enough with the internals of Flink to propose a more 
> advanced fix, however it seems like a configurable threshold on the number of 
> consumer vertices above which the preferred location is not computed would 
> do. If this  solution is good enough, I'd be happy to submit a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-31144) Slow scheduling on large-scale batch jobs

Reply via email to