[ https://issues.apache.org/jira/browse/FLINK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Tournay updated FLINK-31144: ----------------------------------- Description: When executing a complex job graph at high parallelism `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can get slow and cause long pauses where the JobManager becomes unresponsive and all the taskmanagers just wait. I've attached a VisualVM snapshot to illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps] At Spotify we have complex jobs where this issue can cause batch "pause" of 40+ minutes and make the overall execution 30% slower or more. More importantly this prevent us from running said jobs on larger cluster as adding resources to the cluster worsen the issue. We have successfully tested a modified Flink version where `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was completely commented and simply returns an empty collection and confirmed it solves the issue. In the same spirit as a recent change ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)] there could be a mechanism in place to detect when Flink run into this specific issue and just skip the call to `getInputLocationFutures` [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.] I'm not familiar enough with the internals of Flink to propose a more advanced fix, however it seems like a configurable threshold on the number of consumer vertices above which the preferred location is not computed would do. If this solution is good enough, I'd be happy to submit a PR. was: When executing a complex job graph at high parallelism `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can get slow and cause long pauses where the JobManager becomes unresponsive and all the taskmanagers just wait. I've attached a VisualVM snapshot to illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps] At Spotify we have complex jobs where this issue can cause batch "pause" of 40+ minutes and make the overall execution 30% slower or more. More importantly this prevent us from running said jobs on larger cluster as adding resources to the cluster worsen the issue. We have successfully tested a test where `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was completely commented and simply returns an empty collection and confirmed it solves the issue. In the same spirit as a recent change ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)] there could be a mechanism in place to detect when Flink run into this specific issue and just skip the call to `getInputLocationFutures` [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.] I'm not familiar enough with the internals of Flink to propose a more advanced fix, however it seems like a configurable threshold on the number of consumer vertices above which the preferred location is not computed would do. If this solution is good enough, I'd be happy to submit a PR. > Slow scheduling on large-scale batch jobs > ------------------------------------------ > > Key: FLINK-31144 > URL: https://issues.apache.org/jira/browse/FLINK-31144 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Julien Tournay > Priority: Major > Attachments: flink-1.17-snapshot-1676473798013.nps > > > When executing a complex job graph at high parallelism > `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can > get slow and cause long pauses where the JobManager becomes unresponsive and > all the taskmanagers just wait. I've attached a VisualVM snapshot to > illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps] > At Spotify we have complex jobs where this issue can cause batch "pause" of > 40+ minutes and make the overall execution 30% slower or more. > More importantly this prevent us from running said jobs on larger cluster as > adding resources to the cluster worsen the issue. > We have successfully tested a modified Flink version where > `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was > completely commented and simply returns an empty collection and confirmed it > solves the issue. > In the same spirit as a recent change > ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)] > there could be a mechanism in place to detect when Flink run into this > specific issue and just skip the call to `getInputLocationFutures` > [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.] > I'm not familiar enough with the internals of Flink to propose a more > advanced fix, however it seems like a configurable threshold on the number of > consumer vertices above which the preferred location is not computed would > do. If this solution is good enough, I'd be happy to submit a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)