Re:Job leak in attached mode (batch scenario)

Haibo Sun Tue, 16 Jul 2019 19:20:00 -0700

Hi, Qi


As far as I know, there is no such mechanism now. To achieve this, I think it 
may be necessary to add a REST-based heartbeat mechanism between Dispatcher and 
Client. At present, perhaps you can add a monitoring service to deal with these 
residual Flink clusters.


Best,
Haibo

At 2019-07-16 14:42:37, "qi luo" <luoqi...@gmail.com> wrote:
Hi guys,


We runs thousands of Flink batch job everyday. The batch jobs are submitted in 
attached mode, so we can know from the client when the job finished and then 
take further actions. To respond to user abort actions, we submit the jobs with 
"—shutdownOnAttachedExit” so the Flink cluster can be shutdown when the client 
exits.


However, in some cases when the Flink client exists abnormally (such as OOM), 
the shutdown signal will not be sent to Flink cluster, causing the “job leak”. 
The lingering Flink job will continue to run and never ends, consuming large 
amount of resources and even produce unexpected results.


Does Flink has any mechanism to handle such scenario (e.g. Spark has cluster 
mode, where the driver runs in the client side, so the job will exit when 
client exits)? Any idea will be very appreciated!


Thanks,
Qi

Re:Job leak in attached mode (batch scenario)

Reply via email to