Hi, I am using Spark 1.6.0. I have a Spark application that create and cache (in memory) DataFrames (around 50+, with some on single parquet file and some on folder with a few parquet files) with the following codes:
val df = sqlContext.read.parquet df.persist df.count I union them to 3 DataFrames and register that as temp table. Then, run the following codes: val res = sqlContext.sql("table1 union table2 union table3") res.collect() The res.collect() is slower when I cache the DataFrame compare to without cache. e.g. 3 seconds vs 1 second I turn on the DEBUG log and see there is a gap from the res.collect() to start the Spark job. Is the extra time taken by the query planning & optimization? It does not show the gap when I do not cache the dataframe. Anything I am missing here? Regards, Chin Wei