I think you are looking for http://en.wikipedia.org/wiki/Common_subexpression_elimination in the optimizer.
One thing to note is that as we do more and more optimization like this, the optimization time might increase. Do you see a case where this can bring you substantial performance gains? On Sat, May 30, 2015 at 9:02 AM, Justin Uang <justin.u...@gmail.com> wrote: > On second thought, perhaps can this be done by writing a rule that builds > the dag of dependencies between expressions, then convert it into several > layers of projections, where each new layer is allowed to depend on > expression results from previous projections? > > Are there any pitfalls to this approach? > > On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com> > wrote: > >> If I do the following >> >> df2 = df.withColumn('y', df['x'] * 7) >> df3 = df2.withColumn('z', df2.y * 3) >> df3.explain() >> >> Then the result is >> >> > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS >> y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] >> > PhysicalRDD [date#56,id#57,timestamp#58,x#59], >> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163 >> >> Effectively I want to compute >> >> y = f(x) >> z = g(y) >> >> The catalyst optimizer realizes that y#64 is the same as the one >> previously computed, however, when building the projection, it is ignoring >> the fact that it had already computed y, so it calculates `x * 7` twice. >> >> y = x * 7 >> z = x * 7 * 3 >> >> If I wanted to make this fix, would it be possible to do the logic in the >> optimizer phase? I imagine that it's difficult because the expressions in >> InterpretedMutableProjection don't have access to the previous expression >> results, only the input row, and that the design doesn't seem to be catered >> for this. >> >>