xintongsong commented on pull request #18303:
URL: https://github.com/apache/flink/pull/18303#issuecomment-1010713720


   Thanks for working on this, @zjureel & @KarmaGYZ.
   
   I'm still not entirely sure about exposing two executors (or two sets of 
scheduling interfaces) from the RPC endpoint. It's not easy to maintain as it 
requires callers to understand the differences between the two executors (tasks 
will be canceled as soon as the RPC endpoint shutdown for one executor, while 
not canceled immediately for the other).
   
   I understand this is in response to the previous CI problems, where we 
suspect some clean-up operations are not completed due to tasks being canceled, 
and it's not easy to identify which clean-up operations are causing the 
problems. However, I'm not comfortable with scarifying the new design just to 
cover up for the real problems.
   
   I'd suggest to try a few more things before we go for the two-executors 
approach.
   - I noticed there are some calls to `schedule` with a zero delay, (e.g., 
`ApplicationDispatcherBootstrap#runApplicationAsync`). I think the purpose for 
these calls are to process asynchronously, rather than to schedule a future 
task. It might be helpful to check the delay value in 
`MainThreadExecutor#schedule`, and only use the scheduling thread pool when 
delay is positive. My gut feeling is most clean-up tasks are async tasks that 
needs to be executed immediately rather than in future.
   - Another thing we may want to try is to replace 
`ExecutorService#shutdownNow` with `Executor#awaitTermination`, to leave a 
short time for the clean-up tasks to complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to