Hi,
Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a
data point regarding the performance of Group By, indicating there's
excessive GC and it's impacting the throughput. I want to know if the new
memory manager for aggregations (https://github.com/apache/spark/pull/5725/)
is going to address this kind of issue.

I only have a small amount of data on each node (~360MB) with a large heap
size (18 Gig). I still see 2-3 minor collections happening whenever I do a
Select Sum() with a group by(). I have tried with different sizes for Young
Generation without much effect, though not with different GC algorithms
(Hm..I ought to try reducing the rdd storage fraction perhaps).

I have made a chart of my results [1] by adding timing code to
Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
benchmark, running over 10 million records. The chart is from one of the 4
worker nodes in the cluster.

I am trying to square this with a claim on the Project Tungsten blog post
[2]: "When profiling Spark user applications, we’ve found that a large
fraction of the CPU time is spent waiting for data to be fetched from main
memory. "

Am I correct in assuming that SparkSql is yet to reach that level of
efficiency, at least in aggregation operations?

Thanks.

[1] -
https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
[2]
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Pramod

Reply via email to