Hi, I think that the result looks correct. The current Spark spends extra time for getting data from a cache. There are two reasons. One is for a complicated path to get a data. The other is for decompression in the case of a primitive type. The new implementation (https://github.com/apache/spark/pull/15219) is ready for review. It would achieve 1.2x performance improvement for a compressed column and much performance improvement for an uncompressed column.
Best Regards, Kazuaki Ishizaki From: Chin Wei Low <lowchin...@gmail.com> To: user@spark.apache.org Date: 2016/10/07 13:05 Subject: Spark SQL is slower when DataFrame is cache in Memory Hi, I am using Spark 1.6.0. I have a Spark application that create and cache (in memory) DataFrames (around 50+, with some on single parquet file and some on folder with a few parquet files) with the following codes: val df = sqlContext.read.parquet df.persist df.count I union them to 3 DataFrames and register that as temp table. Then, run the following codes: val res = sqlContext.sql("table1 union table2 union table3") res.collect() The res.collect() is slower when I cache the DataFrame compare to without cache. e.g. 3 seconds vs 1 second I turn on the DEBUG log and see there is a gap from the res.collect() to start the Spark job. Is the extra time taken by the query planning & optimization? It does not show the gap when I do not cache the dataframe. Anything I am missing here? Regards, Chin Wei