Re: partitions, coalesce() and parallelism

2014-06-25 Thread Alex Boisvert
Thanks Daniel and Nicholas for the helpful responses. I'll go with coalesce(shuffle = true) and see how things go. On Wed, Jun 25, 2014 at 8:19 AM, Daniel Siegmann wrote: > The behavior you're seeing is by design, and it is VERY IMPORTANT to > understand why this happens because it can cause u

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Daniel Siegmann
The behavior you're seeing is by design, and it is VERY IMPORTANT to understand why this happens because it can cause unexpected behavior in various ways. I learned that the hard way. :-) Spark collapses multiple transforms into a single "stage" wherever possible (presumably for performance). The

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
Ah, here's a better hypothesis. Everything you are doing minus the save() is a transformation, not an action. Since nothing is actually triggered until the save(), Spark may be seeing that the lineage of operations ends with 2 partitions anyway and simplifies accordingly. Two suggestions you can t

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
For the skeptics :), here's a version you can easily reproduce at home: val rdd1 = sc.parallelize(1 to 1000, 100) // force with 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { _ + 1000 } val rdd4 = rdd3.coalesce(2) rdd4.collect() You can see that everything runs as only 2 tasks

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
Yes. scala> rawLogs.partitions.size res1: Int = 2171 On Tue, Jun 24, 2014 at 4:00 PM, Mayur Rustagi wrote: > To be clear number of map tasks are determined by number of partitions > inside the rdd hence the suggestion by Nicholas. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoid

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Mayur Rustagi
To be clear number of map tasks are determined by number of partitions inside the rdd hence the suggestion by Nicholas. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas < nich

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
So do you get 2171 as the output for that command? That command tells you how many partitions your RDD has, so it’s good to first confirm that rdd1 has as many partitions as you think it has. ​ On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert wrote: > It's actually a set of 2171 S3 files, with an

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
It's actually a set of 2171 S3 files, with an average size of about 18MB. On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > What do you get for rdd1._jrdd.splits().size()? You might think you’re > getting > 100 partitions, but it may not be happening. > ​ >

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
What do you get for rdd1._jrdd.splits().size()? You might think you’re getting > 100 partitions, but it may not be happening. ​ On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert wrote: > With the following pseudo-code, > > val rdd1 = sc.sequenceFile(...) // has > 100 partitions > val rdd2 = rdd1.c

partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
With the following pseudo-code, val rdd1 = sc.sequenceFile(...) // has > 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { ... } val rdd4 = rdd3.coalesce(2) val rdd5 = rdd4.saveAsTextFile(...) // want only two output files I would expect the parallelism of the map() operation to