Hi Robert, I am not sure I understand so please confirm if I understand correctly your suggestions: - to use less slots than available slots capacity to avoid issues like when a TaskManager is not giving its slots because of some problems registering the TM; (This means I will lose some performance by not using all the available capacity) -if a job is failing because of losing a TaskManager (and its slots) the job will not restart even if available slots are free to use. (for this case the ‘spare slots’ will not be of help right; losing a TM means the job will fail, no recovery)
Thanks! Best, Ovidiu > On 21 Mar 2016, at 14:15, Robert Metzger <rmetz...@apache.org> wrote: > > Hi Ovidiu, > > right now the scheduler in Flink will not use more slots than requested. > To avoid issues on recovery, we usually recommend users to have some spare > slots (run job with p=15 on a cluster with slots=20). I agree that it would > make sense to add a flag which allows a job to grab more slots if they are > available. The problem with that is however, that jobs can currently not > change their parallelism. So if a job fails, it can not downscale to restart > on the remaining slots. > That's why the spare slots approach is currently the only way to go. > > Regards, > Robert > > On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU > <ovidiu-cristian.ma...@inria.fr <mailto:ovidiu-cristian.ma...@inria.fr>> > wrote: > Hi, > > For the situation where a program specify a maximum parallelism (so it is > supposed to use all available task slots) we can have the possibility that > one of the task managers is not registered for various reasons. > In this case the job will fail for not enough free slots to run the job. > > For me this means the scheduler has a limitation to work by statically assign > tasks to the task slots the job is configured. > > Instead I would like to be able to specify a minimum parallelism of a job but > also the possibility to dynamically use more task slots if additional task > slots can be used. > Another use case will be that if during the execution of a job we lose one > node so some task slots, if the minimum parallelism is still ensured, the job > should recover and continue its execution instead of just failing. > > Is it possible to make such changes? > > Best, > Ovidiu >