>>
>>val rdd4 = rdd3.coalesce(2)
>>val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
>>
>>rdd3.unpersist()
>>
>>This should let the map() run 100 tasks in parallel while giving you
>>only 2 output files. You'll get
1202, 1203, 1204,
1205, 1206, 1207, 1208, 1209, 1210, 1221, 1222, 1223, 1224, 1225, 1226,
1227, 1228, 1229, 1230, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248,
1249...
On Tue, Jun 24, 2014 at 5:39 PM, Alex Boisvert
wrote:
> Yes.
>
> scala> rawLogs.partitions.size
> res1: Int =
lls
>> you how many partitions your RDD has, so it’s good to first confirm that
>> rdd1 has as many partitions as you think it has.
>>
>>
>>
>> On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert
>> wrote:
>>
>>> It's actually
may not be happening.
>
>
>
> On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert
> wrote:
>
>> With the following pseudo-code,
>>
>> val rdd1 = sc.sequenceFile(...) // has > 100 partitions
>> val rdd2 = rdd1.coalesce(100)
>> val rdd3 = rdd2 map {
With the following pseudo-code,
val rdd1 = sc.sequenceFile(...) // has > 100 partitions
val rdd2 = rdd1.coalesce(100)
val rdd3 = rdd2 map { ... }
val rdd4 = rdd3.coalesce(2)
val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
I would expect the parallelism of the map() operation to
You might want to try the built-in RDD.cartesian() method.
On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wrote:
> Hi All,
>
> I have a problem with the Item-Based Collaborative Filtering Recommendation
> Algorithms in spark.
> The basic flow is as below:
>
I'll provide answers from our own experience at Bizo. We've been using
Spark for 1+ year now and have found it generally better than previous
approaches (Hadoop + Hive mostly).
On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth <
andras.nem...@lynxanalytics.com> wrote:
> I. Is it too much magic? Lo