Hi! I am seeing some unexpected behavior with regards to cache() in DataFrames. Here goes:
In my Scala application, I have created a DataFrame that I run multiple operations on. It is expensive to recompute the DataFrame, so I have called cache() after it gets created. I notice that the cache() works as expected for some operations (e.g. count, filter, etc). However, when I run the withColumn() operation, the DataFrame gets recomputed. Is this the expected behavior? Is there a workaround for this? Thanks, Nick P.S. Here is an example program to highlight this: ``` // Examples udf's that println when called val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } // Initial dataset val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value") // Add column by applying twice udf val df2 = df1.withColumn("twice", twice($"value")) df2.cache() df2.count() //prints Computed: twice(1) // Add column by applying triple udf val df3 = df2.withColumn("triple", triple($"value")) df3.cache() df3.count() //prints Computed: twice(1)\nComputed: triple(1) ``` -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org