Hi all,
As a previous thread, I am asking how to implement a divide-and-conquer
algorithm (skyline) in Spark.
Here is my current solution:
val data = sc.textFile(…).map(line => line.split(“,”).map(_.toDouble))
val result = data.mapPartitions(points =>
skyline(points.toArray).iterator).coalesce(1, true)
.mapPartitions(points =>
skyline(points.toArray).iterator).collect()
where skyline is a local algorithm to compute the results:
def skyline(points: Array[Point]) : Array[Point]
Basically, I find this implement runs slower than the corresponding Hadoop
version (the identity map phase plus local skyline for both combine and reduce
phases).
Below are my questions:
1. Why this implementation is much slower than the Hadoop one?
I can find two possible reasons: one is the shuffle overhead in coalesce,
another is calling the toArray and iterator repeatedly when invoking local
skyline algorithm. But I am not sure which one is true.
2. One observation is that while Hadoop version almost used up all the CPU
resources during execution, the CPU seems not that hot on Spark. Is that a clue
to prove that the shuffling might be the real bottleneck?
3. Is there any difference between coalesce(1, true) and reparation? It seems
that both opeartions need shuffling data. What’s the proper situations using
the coalesce method?
4. More generally, I am trying to implementing some core geometry computation
operators on Spark (like skyline, convex hull etc). In my understanding, since
Spark is more capable of handling iterative computations on dataset, the above
solution apparently doesn’t exploit what Spark is good at. Any comments on how
to do geometry computations on Spark (if it is possible) ?
Thanks for any insight.
Yanzhe