To be clear number of map tasks are determined by number of partitions
inside the rdd hence the suggestion by Nicholas.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> So do you get 2171 as the output for that command? That command tells you
> how many partitions your RDD has, so it’s good to first confirm that rdd1
> has as many partitions as you think it has.
> ​
>
>
> On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert <alex.boisv...@gmail.com>
> wrote:
>
>> It's actually a set of 2171 S3 files, with an average size of about 18MB.
>>
>>
>> On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> What do you get for rdd1._jrdd.splits().size()? You might think you’re
>>> getting > 100 partitions, but it may not be happening.
>>> ​
>>>
>>>
>>> On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisv...@gmail.com>
>>> wrote:
>>>
>>>> With the following pseudo-code,
>>>>
>>>> val rdd1 = sc.sequenceFile(...) // has > 100 partitions
>>>> val rdd2 = rdd1.coalesce(100)
>>>> val rdd3 = rdd2 map { ... }
>>>> val rdd4 = rdd3.coalesce(2)
>>>> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
>>>>
>>>> I would expect the parallelism of the map() operation to be 100
>>>> concurrent tasks, and the parallelism of the save() operation to be 2.
>>>>
>>>> However, it appears the parallelism of the entire chain is 2 -- I only
>>>> see two tasks created for the save() operation and those tasks appear to
>>>> execute the map() operation as well.
>>>>
>>>> Assuming what I'm seeing is as-specified (meaning, how things are meant
>>>> to be), what's the recommended way to force a parallelism of 100 on the
>>>> map() operation?
>>>>
>>>> thanks!
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to