Hi,in spark 2.4.0,unpersist a DF only un-cache the given DataSet ,and re-compile dependent cached queries after removing the cached query., just like the question in https://issues.apache.org/jira/browse/SPARK-21478.
When all the Jobs are done, we unpersist the cached data, It can take a long time to rebuild the cached data which will never be used again. Take the following code for example. val x1 = Seq(1).toDF() x1.persist() val x2 = x1.select($"value" * 2) x2.persist() val x3 = x2.select($"value" * 2) x3.persist() x1.count() x2.count() x3.count() ... x1.unpersist() // never be used again, but will re-compile dependent cached queries: x2, x3 x2.unpersist() //never be used again, but will re-compile dependent cached queries: x3 x3.persist() // never be used again So, can we expose the parameters *cascade* in the unpersist method.Let the user choose whether to rebuild or not def unpersist(blocking: Boolean): this.type = { sparkSession.sharedState.cacheManager.uncacheQuery(this, cascade = false, blocking) this }