[ https://issues.apache.org/jira/browse/FLINK-31144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691474#comment-17691474 ]
Julien Tournay commented on FLINK-31144: ---------------------------------------- Hi [~huwh], Thank you for the quick reply :) {quote}when the topology is complex. {quote} Indeed. For the issue to be noticeable, the jobgraph has to be fairly complex, feature all-to-all distributions and execute with a high parallelism. {quote}1. Is the slow scheduling or the scheduled result of location preferred make your job slow? {quote} Yes it very much does. We have a job that takes ~2h30 (after many many tweaks to get the best possible perf.). It's impossible to get it to run in less time because adding more taskmanagers make the scheduling slow and overall the execution gets longer. Removing preferred location makes it possible to run it in less that 2h (We're aiming at ~1h45min). {quote}2. "we have complex jobs where this issue can cause batch "pause" of 40+ minutes" What does "pause" meaning? Is the getPreferredLocationsBasedOnInputs take more than 40+ minutes? {quote} By "pause" I mean that at the beginning of the execution, the taskmanagers will wait for the JobManager for ~40min and then will start processing. With Flink 1.17 and no preferred location, the "pause" is down to ~5min. I should also mention the JM is very unresponsive and the web console struggles the show anything. {quote}Could you provide the topology of the complex job. {quote} I can but not sure what format to use. The graph is quite big and a simple screenshot is unreadable: !image-2023-02-21-10-29-49-388.png! I can maybe share the archived execution json file (~500Mb) if that's helpful ? > Slow scheduling on large-scale batch jobs > ------------------------------------------ > > Key: FLINK-31144 > URL: https://issues.apache.org/jira/browse/FLINK-31144 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Julien Tournay > Priority: Major > Attachments: flink-1.17-snapshot-1676473798013.nps, > image-2023-02-21-10-29-49-388.png > > > When executing a complex job graph at high parallelism > `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` can > get slow and cause long pauses where the JobManager becomes unresponsive and > all the taskmanagers just wait. I've attached a VisualVM snapshot to > illustrate the problem.[^flink-1.17-snapshot-1676473798013.nps] > At Spotify we have complex jobs where this issue can cause batch "pause" of > 40+ minutes and make the overall execution 30% slower or more. > More importantly this prevent us from running said jobs on larger cluster as > adding resources to the cluster worsen the issue. > We have successfully tested a modified Flink version where > `DefaultPreferredLocationsRetriever.getPreferredLocationsBasedOnInputs` was > completely commented and simply returns an empty collection and confirmed it > solves the issue. > In the same spirit as a recent change > ([https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L98-L102)] > there could be a mechanism in place to detect when Flink run into this > specific issue and just skip the call to `getInputLocationFutures` > [https://github.com/apache/flink/blob/43f419d0eccba86ecc8040fa6f521148f1e358ff/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultPreferredLocationsRetriever.java#L105-L108.] > I'm not familiar enough with the internals of Flink to propose a more > advanced fix, however it seems like a configurable threshold on the number of > consumer vertices above which the preferred location is not computed would > do. If this solution is good enough, I'd be happy to submit a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)