I was able to workaround this by converting the DataFrame to an RDD and then back to DataFrame. This seems very weird to me, so any insight would be much appreciated!
Thanks, Nick P.S. Here's the updated code with the workaround: ``` // Examples udf's that println when called val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } // Initial dataset val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value") // Add column by applying twice udf val df2 = { val tmp = df1.withColumn("twice", twice($"value")) sqlContext.createDataFrame(tmp.rdd, tmp.schema) } df2.cache() df2.count() //prints Computed: twice(1) // Add column by applying triple udf val df3 = df2.withColumn("triple", triple($"value")) df3.cache() df3.count() //prints Computed: triple(1) ``` -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836p23839.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org