Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone box, with 24 cores and 64G memory. We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we are trying to run: 1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest structure of 3 array of struct in AVRO2) Dataset2, 5G AVRO file with snappy compression3) Dataset3, 2.3M AVRO file with snappy compression. The basic structure of the query is like this:
(selectxxxfromdataset1 lateral view outer explode(struct1) lateral view outer explode(struct2)where xxxxx )left outer join(select xxxx from dataset2 lateral view explode(xxx) where xxxx)on xxxxleft outer join(select xxx from dataset3 where xxxx)on xxxxx So overall what it does is 2 outer explode on dataset1, left outer join with explode of dataset2, then finally left outer join with dataset 3. On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0. Baseline, the above query can finish around 50 minutes in Hive 12, with 6 mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs. This is a very expensive query running in our production, of course with much bigger data set, every day. Now I want to see how fast Spark can do for the same query. I am using the following settings, based on my understanding of Spark, for a fair test between it and Hive: export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2g--executor-memory 9g --total-executor-cores 9 I am trying to run the one executor with 9 cores and max 9G heap, to make Spark use almost same resource we gave to the MapReduce. Here is the result without any additional configuration changes, running under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same query: The Spark SQL generated 5 stage of tasks, shown below:4 collect at SparkPlan.scala:84 +details 2015/02/20 10:48:46 26 s 200/200 3 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 16 min 200/200 1112.3 MB2 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 9 min 40/40 4.7 GB 22.2 GB1 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50 6.2 GB 2.8 GB0 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 6 s 2/2 2.3 MB 156.6 KB So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes. It is about 56% of originally time, not bad. But I want to know any tuning of Spark can make it even faster. For stage 2 and 3, I observed that GC time is more and more expensive. Especially in stage 3, shown below: For stage 3:Metric Min 25th percentile Median 75th percentile MaxDuration 20 s 30 s 35 s 39 s 2.4 minGC Time 9 s 17 s 20 s 25 s 2.2 minShuffle Write 4.7 MB 4.9 MB 5.2 MB 6.1 MB 8.3 MB So in median, the GC time took overall 20s/35s = 57% of time. First change I made is to add the following line in the spark-default.conf:spark.serializer org.apache.spark.serializer.KryoSerializer My assumption is that using kryoSerializer, instead of default java serialize, will lower the memory footprint, should lower the GC pressure during runtime. I know the I changed the correct spark-default.conf, because if I were add "spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" in the same file, I will see the GC usage in the stdout file. Of course, in this test, I didn't add that, as I want to only make one change a time.The result is almost the same, as using standard java serialize. The wall time is still 28 minutes, and in stage 3, the GC still took around 50 to 60% of time, almost same result within "min, median to max" in stage 3, without any noticeable performance gain. Next, based on my understanding, and for this test, I think the default spark.storage.memoryFraction is too high for this query, as there is no reason to reserve so much memory for caching data, Because we don't reuse any dataset in this one query. So I add this at the end of spark-shell command "--conf spark.storage.memoryFraction=0.3", as I want to just reserve half of the memory for caching data vs first time. Of course, this time, I rollback the first change of "KryoSerializer". The result looks like almost the same. The whole query finished around 28s + 14m + 9.6m + 1.9m + 6s = 27 minutes. It looks like that Spark is faster than Hive, but is there any steps I can make it even faster? Why using "KryoSerializer" makes no difference? If I want to use the same resource as now, anything I can do to speed it up more, especially lower the GC time? Thanks Yong