1996fanrui commented on PR #25218: URL: https://github.com/apache/flink/pull/25218#issuecomment-2421167238
Hey @XComp @ztison , sorry, I'd like to discuss with you again about this PR. Could we fix the issue for `DefaultSlotAssigner` and `Application Mode` first? I prefer to fix it first for several reasons: - @XComp 's first [concern](https://github.com/apache/flink/pull/25218#discussion_r1771187666) is this fix conflicts with `execution.state-recovery.from-local`, so it's better to be handled in a FLIP. - That's why this PR only change the code of `DefaultSlotAssigner` and doesn't change any code of `StateLocalitySlotAssigner`. - @ztison 's [concern](https://github.com/apache/flink/pull/25218#issuecomment-2401913141) is this fix conflicts with spreading the workload across as many workers as possible. - As we discussed before, this concern only exists for session mode. That why I'm curious could we fix it for `Application Mode` first. - The third reason is most important: the issue that this PR is trying to fix is more like a bug than an optimization for `Application Mode` and disable `execution.state-recovery.from-local`. - The phenomenon of this bug is that TM resources cannot be released after scaling down. - I believe that flink users use Adaptive Scheduler mainly to scale up and scale down quickly or more efficiently. - Many users have questions like: why resources can be saved after scaling down. - This bug is reported to 3 JIRAs: FLINK-33977, FLINK-35594 and FLINK-35903. - The main reason I wanna discuss with you again is : one flink user [reported this bug ](https://apache-flink.slack.com/archives/C03G7LJTS2G/p1729167222445569)again in the Slack troubleshooting channel, the the reporter cc me in the [next thread](https://apache-flink.slack.com/archives/C03G7LJTS2G/p1729167719506889) due to I'm the active contributor of autoscaler. (I guess he doesn't know the bug or phenomenon is not related to autoscaler, it's related to Adaptive Scheduler) - It is worth mentioning that as I know @RocMarshal (the developer of this PR) doesn't report any jira, because he noticed the issue is reported via some JIRAs. - It means at least 5 users(From what I have observed, these 5 users come from 5 different companies) faced this issue in their production jobs. I’m happy to see more and more companies trying out Adaptive Scheduler. - The fourth reason: 1.20 is the LTS version for 1.x series. - If we think it's a bug, we could fix it in 1.20.x and 2.0.x together. - If we think it's an improvement or feature rather than a bug , and improve it in a FLIP, it means this issue cannot be fixed in 1.x series. - That's why [FLIP-461](https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler) and [FLIP-472](https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states) cannot be backported to 1.x series. - Actually, I think both of [FLIP-461](https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler) and [FLIP-472](https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states) are great improvement for Adaptive Scheduler. Thank you for the great work. ❤️ - I believe most of users(companies) are not able to maintain the internal flink version, and they use the official flink version. If this bug is not fixed in 1.x, it may be difficult for Adaptive Scheduler to be used by a large number of users in 1.x. - Of course, my team maintains our internal flink version. We can easily fix it in our production environment. My initiative is mainly to enable most flink users to have a better Adaptive Scheduler experience. Sorry to bother you again. This is definitely my last try. If you think it is unreasonable, I can accept it and deal with it in a subsequent FLIP. Thank you very much. ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org