The default number of tasks when reading files is based on how the files
are split among the nodes. Beyond that, the default number of tasks after a
shuffle is based on the property spark.default.parallelism. (see
http://spark.apache.org/docs/latest/configuration.html).

You can use RDD.repartition to increase or decrease the number of tasks (or
RDD.coalesce, but you must set shuffle to true if you want to increase the
partitions). Other RDD methods which cause a shuffle usually have a
parameter to set the number of tasks.



On Mon, Jul 7, 2014 at 11:25 AM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> Hi all,
>
> is it any way to control the number tasks per stage?
>
> currently I see situation when only 2 tasks are created per stage and each
> of them is very slow, at the same time cluster has a huge number of unused
> nodes....
>
>
> Thank you,
> Konstantin Kudryavtsev
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to