Hi,

I am using Spark 1.6.0. I have a Spark application that create and cache
(in memory) DataFrames (around 50+, with some on single parquet file and
some on folder with a few parquet files) with the following codes:

val df = sqlContext.read.parquet
df.persist
df.count

I union them to 3 DataFrames and register that as temp table.

Then, run the following codes:
val res = sqlContext.sql("table1 union table2 union table3")
res.collect()

The res.collect() is slower when I cache the DataFrame compare to without
cache. e.g. 3 seconds vs 1 second

I turn on the DEBUG log and see there is a gap from the res.collect() to
start the Spark job.

Is the extra time taken by the query planning & optimization? It does not
show the gap when I do not cache the dataframe.

Anything I am missing here?

Regards,
Chin Wei

Reply via email to