1996fanrui commented on PR #25218:
URL: https://github.com/apache/flink/pull/25218#issuecomment-2421167238

   Hey @XComp @ztison , sorry, I'd like to discuss with you again about this PR.
   
   Could we fix the issue for `DefaultSlotAssigner` and `Application Mode` 
first? I prefer to fix it first for several reasons:
   
   - @XComp 's first 
[concern](https://github.com/apache/flink/pull/25218#discussion_r1771187666) is 
this fix conflicts with `execution.state-recovery.from-local`, so it's better 
to be handled in a FLIP.
       - That's why this PR only change the code of `DefaultSlotAssigner` and 
doesn't change any code of `StateLocalitySlotAssigner`.
   - @ztison 's 
[concern](https://github.com/apache/flink/pull/25218#issuecomment-2401913141) 
is this fix conflicts with spreading the workload across as many workers as 
possible.
       - As we discussed before, this concern only exists for session mode. 
That why I'm curious could we fix it for `Application Mode` first.
   - The third reason is most important: the issue that this PR is trying to 
fix is ​​more like a bug than an optimization for `Application Mode` and 
disable `execution.state-recovery.from-local`.
       - The phenomenon of this bug is that TM resources cannot be released 
after scaling down.
       - I believe that flink users use Adaptive Scheduler mainly to scale up 
and scale down quickly or more efficiently.
       - Many users have questions like: why resources can be saved after 
scaling down.
       - This bug is reported to 3 JIRAs: FLINK-33977, FLINK-35594 and 
FLINK-35903.
       - The main reason I wanna discuss with you again is :  one flink user 
[reported this bug 
](https://apache-flink.slack.com/archives/C03G7LJTS2G/p1729167222445569)again 
in the Slack troubleshooting channel, the the reporter cc me in the [next 
thread](https://apache-flink.slack.com/archives/C03G7LJTS2G/p1729167719506889) 
due to I'm the active contributor of autoscaler. (I guess he doesn't know the 
bug or phenomenon is not related to autoscaler, it's related to Adaptive 
Scheduler)
       - It is worth mentioning that as I know @RocMarshal (the developer of 
this PR) doesn't report any jira, because he noticed the issue is reported via 
some JIRAs.
       - It means at least 5 users(From what I have observed, these 5 users 
come from 5 different companies) faced this issue in their production jobs. I’m 
happy to see more and more companies trying out Adaptive Scheduler.
   - The fourth reason: 1.20 is the LTS version for 1.x series.
       - If we think it's a bug, we could fix it in 1.20.x and 2.0.x together.
       - If we think it's an improvement or feature rather than a bug , and 
improve it in a FLIP, it means this issue cannot be fixed in 1.x series.
           - That's why 
[FLIP-461](https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler)
 and 
[FLIP-472](https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states)
 cannot be backported to 1.x series.
           - Actually, I think both of 
[FLIP-461](https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler)
 and 
[FLIP-472](https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states)
  are great improvement for Adaptive Scheduler. Thank you for the great work. ❤️
       - I believe most of users(companies) are not able to maintain the 
internal flink version, and they use the official flink version. If this bug is 
not fixed in 1.x, it may be difficult for Adaptive Scheduler to be used by a 
large number of users in 1.x.
       - Of course, my team maintains our internal flink version. We can easily 
fix it in our production environment. My initiative is mainly to enable most 
flink users to have a better Adaptive Scheduler experience.
   
   Sorry to bother you again. This is definitely my last try. If you think it 
is unreasonable, I can accept it and deal with it in a subsequent FLIP. Thank 
you very much. ❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to