What do you get for rdd1._jrdd.splits().size()? You might think you’re getting > 100 partitions, but it may not be happening.
On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com> wrote: > With the following pseudo-code, > > val rdd1 = sc.sequenceFile(...) // has > 100 partitions > val rdd2 = rdd1.coalesce(100) > val rdd3 = rdd2 map { ... } > val rdd4 = rdd3.coalesce(2) > val rdd5 = rdd4.saveAsTextFile(...) // want only two output files > > I would expect the parallelism of the map() operation to be 100 concurrent > tasks, and the parallelism of the save() operation to be 2. > > However, it appears the parallelism of the entire chain is 2 -- I only see > two tasks created for the save() operation and those tasks appear to > execute the map() operation as well. > > Assuming what I'm seeing is as-specified (meaning, how things are meant to > be), what's the recommended way to force a parallelism of 100 on the map() > operation? > > thanks! > > >