Hi!

I am seeing some unexpected behavior with regards to cache() in DataFrames.
Here goes:

In my Scala application, I have created a DataFrame that I run multiple
operations on. It is expensive to recompute the DataFrame, so I have called
cache() after it gets created. 

I notice that the cache() works as expected for some operations (e.g. count,
filter, etc). However, when I run the withColumn() operation, the DataFrame
gets recomputed.

Is this the expected behavior? Is there a workaround for this?

Thanks,
Nick


P.S. Here is an example program to highlight this:
```
    // Examples udf's that println when called
    val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
    val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }

    // Initial dataset
    val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value")

    // Add column by applying twice udf
    val df2 = df1.withColumn("twice", twice($"value"))
    df2.cache()
    df2.count() //prints Computed: twice(1)

    // Add column by applying triple udf
    val df3 = df2.withColumn("triple", triple($"value"))
    df3.cache()
    df3.count() //prints Computed: twice(1)\nComputed: triple(1)
```





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to