Hi,
I tried to write small program which shows that using cache() can speed up
execution but results with and without cache were similar. Could help me
with this issue? I tried to compute rdd and use it later in two places and
I thought in second usage this rdd is recomputed but it doesn't:
val help = sc.parallelize(Array.range(1, 20000)).repartition(100)
.map(x => (scala.util.Random.nextInt(10), x))
val rdd = sc.parallelize(Array.range(1,20000))
.repartition(100)
.map(x => (scala.util.Random.nextInt(10), x))
.join(help)
.map { case (x, (n, i)) => (x, n)}
.reduceByKey(_ + _)
.cache()
val rdd2 = sc.parallelize(Array.range(1,1000)).map(x => (x, x))
.join(rdd).saveAsTextFile("output/1")
val rdd3 = sc.parallelize(Array.range(1,1000)).map(x =>
(scala.util.Random.nextInt(1000), x))
.join(rdd).saveAsTextFile("output/2")
Thanks,
Grzegorz