If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain()
Then the result is > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] > PhysicalRDD [date#56,id#57,timestamp#58,x#59], MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163 Effectively I want to compute y = f(x) z = g(y) The catalyst optimizer realizes that y#64 is the same as the one previously computed, however, when building the projection, it is ignoring the fact that it had already computed y, so it calculates `x * 7` twice. y = x * 7 z = x * 7 * 3 If I wanted to make this fix, would it be possible to do the logic in the optimizer phase? I imagine that it's difficult because the expressions in InterpretedMutableProjection don't have access to the previous expression results, only the input row, and that the design doesn't seem to be catered for this.