Hi!
I am seeing some unexpected behavior with regards to cache() in DataFrames.
Here goes:
In my Scala application, I have created a DataFrame that I run multiple
operations on. It is expensive to recompute the DataFrame, so I have called
cache() after it gets created.
I notice that the cache() works as expected for some operations (e.g. count,
filter, etc). However, when I run the withColumn() operation, the DataFrame
gets recomputed.
Is this the expected behavior? Is there a workaround for this?
Thanks,
Nick
P.S. Here is an example program to highlight this:
```
// Examples udf's that println when called
val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }
// Initial dataset
val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value")
// Add column by applying twice udf
val df2 = df1.withColumn("twice", twice($"value"))
df2.cache()
df2.count() //prints Computed: twice(1)
// Add column by applying triple udf
val df3 = df2.withColumn("triple", triple($"value"))
df3.cache()
df3.count() //prints Computed: twice(1)\nComputed: triple(1)
```
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]