To be clear number of map tasks are determined by number of partitions inside the rdd hence the suggestion by Nicholas.
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > So do you get 2171 as the output for that command? That command tells you > how many partitions your RDD has, so it’s good to first confirm that rdd1 > has as many partitions as you think it has. > > > > On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert <alex.boisv...@gmail.com> > wrote: > >> It's actually a set of 2171 S3 files, with an average size of about 18MB. >> >> >> On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> What do you get for rdd1._jrdd.splits().size()? You might think you’re >>> getting > 100 partitions, but it may not be happening. >>> >>> >>> >>> On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com> >>> wrote: >>> >>>> With the following pseudo-code, >>>> >>>> val rdd1 = sc.sequenceFile(...) // has > 100 partitions >>>> val rdd2 = rdd1.coalesce(100) >>>> val rdd3 = rdd2 map { ... } >>>> val rdd4 = rdd3.coalesce(2) >>>> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files >>>> >>>> I would expect the parallelism of the map() operation to be 100 >>>> concurrent tasks, and the parallelism of the save() operation to be 2. >>>> >>>> However, it appears the parallelism of the entire chain is 2 -- I only >>>> see two tasks created for the save() operation and those tasks appear to >>>> execute the map() operation as well. >>>> >>>> Assuming what I'm seeing is as-specified (meaning, how things are meant >>>> to be), what's the recommended way to force a parallelism of 100 on the >>>> map() operation? >>>> >>>> thanks! >>>> >>>> >>>> >>> >> >