xintongsong commented on pull request #18303: URL: https://github.com/apache/flink/pull/18303#issuecomment-1010713720
Thanks for working on this, @zjureel & @KarmaGYZ. I'm still not entirely sure about exposing two executors (or two sets of scheduling interfaces) from the RPC endpoint. It's not easy to maintain as it requires callers to understand the differences between the two executors (tasks will be canceled as soon as the RPC endpoint shutdown for one executor, while not canceled immediately for the other). I understand this is in response to the previous CI problems, where we suspect some clean-up operations are not completed due to tasks being canceled, and it's not easy to identify which clean-up operations are causing the problems. However, I'm not comfortable with scarifying the new design just to cover up for the real problems. I'd suggest to try a few more things before we go for the two-executors approach. - I noticed there are some calls to `schedule` with a zero delay, (e.g., `ApplicationDispatcherBootstrap#runApplicationAsync`). I think the purpose for these calls are to process asynchronously, rather than to schedule a future task. It might be helpful to check the delay value in `MainThreadExecutor#schedule`, and only use the scheduling thread pool when delay is positive. My gut feeling is most clean-up tasks are async tasks that needs to be executed immediately rather than in future. - Another thing we may want to try is to replace `ExecutorService#shutdownNow` with `Executor#awaitTermination`, to leave a short time for the clean-up tasks to complete. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org