Re: DataFrame.withColumn() recomputes columns even after cache()

pnpritchard Tue, 14 Jul 2015 16:45:58 -0700

I was able to workaround this by converting the DataFrame to an RDD and then
back to DataFrame. This seems very weird to me, so any insight would be much
appreciated!


Thanks,
Nick


P.S. Here's the updated code with the workaround:
```
    // Examples udf's that println when called
    val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
    val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }

    // Initial dataset
    val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value")

    // Add column by applying twice udf
    val df2 = {
      val tmp = df1.withColumn("twice", twice($"value"))
      sqlContext.createDataFrame(tmp.rdd, tmp.schema)
    }
    df2.cache()
    df2.count() //prints Computed: twice(1)

    // Add column by applying triple udf
    val df3 = df2.withColumn("triple", triple($"value"))
    df3.cache()
    df3.count() //prints Computed: triple(1)
```



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836p23839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DataFrame.withColumn() recomputes columns even after cache()

Reply via email to