What do you get for rdd1._jrdd.splits().size()? You might think you’re
getting > 100 partitions, but it may not be happening.
​


On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com>
wrote:

> With the following pseudo-code,
>
> val rdd1 = sc.sequenceFile(...) // has > 100 partitions
> val rdd2 = rdd1.coalesce(100)
> val rdd3 = rdd2 map { ... }
> val rdd4 = rdd3.coalesce(2)
> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
>
> I would expect the parallelism of the map() operation to be 100 concurrent
> tasks, and the parallelism of the save() operation to be 2.
>
> However, it appears the parallelism of the entire chain is 2 -- I only see
> two tasks created for the save() operation and those tasks appear to
> execute the map() operation as well.
>
> Assuming what I'm seeing is as-specified (meaning, how things are meant to
> be), what's the recommended way to force a parallelism of 100 on the map()
> operation?
>
> thanks!
>
>
>

Reply via email to