Hi Robert,

I am not sure I understand so please confirm if I understand correctly your 
suggestions:
- to use less slots than available slots capacity to avoid issues like when a 
TaskManager is not giving its slots because of some problems registering the TM;
(This means I will lose some performance by not using all the available 
capacity)
-if a job is failing because of losing a TaskManager (and its slots) the job 
will not restart even if available slots are free to use.
(for this case the ‘spare slots’ will not be of help right; losing a TM means 
the job will fail, no recovery)

Thanks!

Best,
Ovidiu


> On 21 Mar 2016, at 14:15, Robert Metzger <rmetz...@apache.org> wrote:
> 
> Hi Ovidiu,
> 
> right now the scheduler in Flink will not use more slots than requested.
> To avoid issues on recovery, we usually recommend users to have some spare 
> slots (run job with p=15 on a cluster with slots=20). I agree that it would 
> make sense to add a flag which allows a job to grab more slots if they are 
> available. The problem with that is however, that jobs can currently not 
> change their parallelism. So if a job fails, it can not downscale to restart 
> on the remaining slots.
> That's why the spare slots approach is currently the only way to go.
> 
> Regards,
> Robert
> 
> On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU 
> <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> 
> wrote:
> Hi,
> 
> For the situation where a program specify a maximum parallelism (so it is 
> supposed to use all available task slots) we can have the possibility that 
> one of the task managers is not registered for various reasons.
> In this case the job will fail for not enough free slots to run the job.
> 
> For me this means the scheduler has a limitation to work by statically assign 
> tasks to the task slots the job is configured.
> 
> Instead I would like to be able to specify a minimum parallelism of a job but 
> also the possibility to dynamically use more task slots if additional task 
> slots can be used.
> Another use case will be that if during the execution of a job we lose one 
> node so some task slots, if the minimum parallelism is still ensured, the job 
> should recover and continue its execution instead of just failing.
> 
> Is it possible to make such changes?
> 
> Best,
> Ovidiu
> 

Reply via email to