Why these operations are slower than the equivalent on Hadoop?

Yanzhe Chen Tue, 15 Apr 2014 10:05:32 -0700

Hi all,  

As a previous thread, I am asking how to implement a divide-and-conquer 
algorithm (skyline) in Spark.
Here is my current solution:


val data = sc.textFile(…).map(line => line.split(“,”).map(_.toDouble))

val result = data.mapPartitions(points => 
skyline(points.toArray).iterator).coalesce(1, true)
                 .mapPartitions(points => 
skyline(points.toArray).iterator).collect()

where skyline is a local algorithm to compute the results:

def skyline(points: Array[Point]) : Array[Point]

Basically, I find this implement runs slower than the corresponding Hadoop 
version (the identity map phase plus local skyline for both combine and reduce 
phases).

Below are my questions:

1. Why this implementation is much slower than the Hadoop one?  

I can find two possible reasons: one is the shuffle overhead in coalesce, 
another is calling the toArray and iterator repeatedly when invoking local 
skyline algorithm. But I am not sure which one is true.

2. One observation is that while Hadoop version almost used up all the CPU 
resources during execution, the CPU seems not that hot on Spark. Is that a clue 
to prove that the shuffling might be the real bottleneck?

3. Is there any difference between coalesce(1, true) and reparation? It seems 
that both opeartions need shuffling data. What’s the proper situations using 
the coalesce method?

4. More generally, I am trying to implementing some core geometry computation 
operators on Spark (like skyline, convex hull etc). In my understanding, since 
Spark is more capable of handling iterative computations on dataset, the above 
solution apparently doesn’t exploit what Spark is good at. Any comments on how 
to do geometry computations on Spark (if it is possible) ?

Thanks for any insight.

Yanzhe

Why these operations are slower than the equivalent on Hadoop?

Reply via email to