I hadn't turned on codegen. I enabled it and ran it again, it is running 4-5 times faster now! :) Since my log statements are no longer appearing, I presume the code path seems quite different from the earlier hashmap related stuff in Aggregates.scala?
Pramod On Wed, May 20, 2015 at 9:18 PM, Reynold Xin <r...@databricks.com> wrote: > Does this turn codegen on? I think the performance is fairly different > when codegen is turned on. > > For 1.5, we are investigating having codegen on by default, so users get > much better performance out of the box. > > > On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri <pramodbilig...@gmail.com > > wrote: > >> Hi, >> Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have >> a data point regarding the performance of Group By, indicating there's >> excessive GC and it's impacting the throughput. I want to know if the new >> memory manager for aggregations ( >> https://github.com/apache/spark/pull/5725/) is going to address this >> kind of issue. >> >> I only have a small amount of data on each node (~360MB) with a large >> heap size (18 Gig). I still see 2-3 minor collections happening whenever I >> do a Select Sum() with a group by(). I have tried with different sizes for >> Young Generation without much effect, though not with different GC >> algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps). >> >> I have made a chart of my results [1] by adding timing code to >> Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab >> benchmark, running over 10 million records. The chart is from one of the 4 >> worker nodes in the cluster. >> >> I am trying to square this with a claim on the Project Tungsten blog post >> [2]: "When profiling Spark user applications, we’ve found that a large >> fraction of the CPU time is spent waiting for data to be fetched from main >> memory. " >> >> Am I correct in assuming that SparkSql is yet to reach that level of >> efficiency, at least in aggregation operations? >> >> Thanks. >> >> [1] - >> https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174 >> [2] >> https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html >> >> Pramod >> > >