Hello guys,
I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with >10GB RAM
each, but the join seems to be taking too long (> 2hrs).
I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.
Do you have any idea why this is happening?
Thanks a lot,
Kostas
*val *sc = *new *SparkContext(
args(0),
*"DummyJoin"*,
System.*getenv*(*"SPARK_HOME"*),
*Seq*(System.*getenv*(*"SPARK_EXAMPLES_JAR"*)))
*val *file = sc.textFile(args(1))
*val *wordTuples = file
.flatMap(line => line.split(args(2)))
.map(word => (word, 1))
*val *big = wordTuples.filter {
*case *((k, v)) => k !=
*"a" *}.cache()
*val *small = wordTuples.filter {
*case *((k, v)) => k != *"a" *&& k != *"to" *&& k !=
*"and" *}.cache()
*val *res = big.leftOuterJoin(small)
res.saveAsTextFile(args(3))
}