Hi Chin Wei, Yes, since you force to create a cache by executing df.count, Spark starts to get data from cache for the following task: val res = sqlContext.sql("table1 union table2 union table3") res.collect()
If you insert 'res.explain', you can confirm which resource you use to get data, cache or parquet? val res = sqlContext.sql("table1 union table2 union table3") res.explain(true) res.collect() Do I make some misunderstandings? Best Regards, Kazuaki Ishizaki From: Chin Wei Low <lowchin...@gmail.com> To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: user@spark.apache.org Date: 2016/10/07 20:06 Subject: Re: Spark SQL is slower when DataFrame is cache in Memory Hi Ishizaki san, So there is a gap between res.collect and when I see this log: spark.SparkContext: Starting job: collect at <console>:26 What you mean is, during this time Spark already start to get data from cache? Isn't it should only get the data after the job is started and tasks are distributed? Regards, Chin Wei On Fri, Oct 7, 2016 at 3:43 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote: Hi, I think that the result looks correct. The current Spark spends extra time for getting data from a cache. There are two reasons. One is for a complicated path to get a data. The other is for decompression in the case of a primitive type. The new implementation (https://github.com/apache/spark/pull/15219) is ready for review. It would achieve 1.2x performance improvement for a compressed column and much performance improvement for an uncompressed column. Best Regards, Kazuaki Ishizaki From: Chin Wei Low <lowchin...@gmail.com> To: user@spark.apache.org Date: 2016/10/07 13:05 Subject: Spark SQL is slower when DataFrame is cache in Memory Hi, I am using Spark 1.6.0. I have a Spark application that create and cache (in memory) DataFrames (around 50+, with some on single parquet file and some on folder with a few parquet files) with the following codes: val df = sqlContext.read.parquet df.persist df.count I union them to 3 DataFrames and register that as temp table. Then, run the following codes: val res = sqlContext.sql("table1 union table2 union table3") res.collect() The res.collect() is slower when I cache the DataFrame compare to without cache. e.g. 3 seconds vs 1 second I turn on the DEBUG log and see there is a gap from the res.collect() to start the Spark job. Is the extra time taken by the query planning & optimization? It does not show the gap when I do not cache the dataframe. Anything I am missing here? Regards, Chin Wei