I heard Spark SQL is lazy: whenver a result table is referred, Spark recalculates the table :(
For example, WITH tab0 AS ( -- some complicated SQL that generates a table -- with size of Giga bytes or Tera bytes ), tab1 AS ( -- use tab0 ), tab2 AS ( -- use tab0 ), ... tabn AS ( -- use tab0 ), select * from tab1 join tab2 on ... ... join tabn on ... ... Spark could recalculate tab0 N times. To avoid this, it is possible to save tab0 as a temp table. I found two solutions. 1) save tab0 into parquet, then load it into a temp view https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html How does createOrReplaceTempView work in Spark? 2) make tab0 persistent https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence Which one is better in terms of query speed? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org