Re: partitions, coalesce() and parallelism

2014-06-25 Thread Alex Boisvert
>> >>val rdd4 = rdd3.coalesce(2) >>val rdd5 = rdd4.saveAsTextFile(...) // want only two output files >> >>rdd3.unpersist() >> >>This should let the map() run 100 tasks in parallel while giving you >>only 2 output files. You'll get

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249... On Tue, Jun 24, 2014 at 5:39 PM, Alex Boisvert wrote: > Yes. > > scala> rawLogs.partitions.size > res1: Int =

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
lls >> you how many partitions your RDD has, so it’s good to first confirm that >> rdd1 has as many partitions as you think it has. >> ​ >> >> >> On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert >> wrote: >> >>> It's actually

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
may not be happening. > ​ > > > On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert > wrote: > >> With the following pseudo-code, >> >> val rdd1 = sc.sequenceFile(...) // has > 100 partitions >> val rdd2 = rdd1.coalesce(100) >> val rdd3 = rdd2 map {

partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
With the following pseudo-code, val rdd1 = sc.sequenceFile(...) // has > 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { ... } val rdd4 = rdd3.coalesce(2) val rdd5 = rdd4.saveAsTextFile(...) // want only two output files I would expect the parallelism of the map() operation to

Re: what is the best way to do cartesian

2014-04-25 Thread Alex Boisvert
You might want to try the built-in RDD.cartesian() method. On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wrote: > Hi All, > > I have a problem with the Item-Based Collaborative Filtering Recommendation > Algorithms in spark. > The basic flow is as below: >

Re: Spark - ready for prime time?

2014-04-10 Thread Alex Boisvert
I'll provide answers from our own experience at Bizo. We've been using Spark for 1+ year now and have found it generally better than previous approaches (Hadoop + Hive mostly). On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > I. Is it too much magic? Lo