I was able to workaround this by converting the DataFrame to an RDD and then
back to DataFrame. This seems very weird to me, so any insight would be much
appreciated!
Thanks,
Nick
P.S. Here's the updated code with the workaround:
```
// Examples udf's that println when called
val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }
// Initial dataset
val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value")
// Add column by applying twice udf
val df2 = {
val tmp = df1.withColumn("twice", twice($"value"))
sqlContext.createDataFrame(tmp.rdd, tmp.schema)
}
df2.cache()
df2.count() //prints Computed: twice(1)
// Add column by applying triple udf
val df3 = df2.withColumn("triple", triple($"value"))
df3.cache()
df3.count() //prints Computed: triple(1)
```
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836p23839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]