I think you are looking for
http://en.wikipedia.org/wiki/Common_subexpression_elimination in the
optimizer.

One thing to note is that as we do more and more optimization like this,
the optimization time might increase. Do you see a case where this can
bring you substantial performance gains?


On Sat, May 30, 2015 at 9:02 AM, Justin Uang <justin.u...@gmail.com> wrote:

> On second thought, perhaps can this be done by writing a rule that builds
> the dag of dependencies between expressions, then convert it into several
> layers of projections, where each new layer is allowed to depend on
> expression results from previous projections?
>
> Are there any pitfalls to this approach?
>
> On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com>
> wrote:
>
>> If I do the following
>>
>>     df2 = df.withColumn('y', df['x'] * 7)
>>     df3 = df2.withColumn('z', df2.y * 3)
>>     df3.explain()
>>
>> Then the result is
>>
>>     > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS
>> y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65]
>>     >  PhysicalRDD [date#56,id#57,timestamp#58,x#59],
>> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163
>>
>> Effectively I want to compute
>>
>>     y = f(x)
>>     z = g(y)
>>
>> The catalyst optimizer realizes that y#64 is the same as the one
>> previously computed, however, when building the projection, it is ignoring
>> the fact that it had already computed y, so it calculates `x * 7` twice.
>>
>>     y = x * 7
>>     z = x * 7 * 3
>>
>> If I wanted to make this fix, would it be possible to do the logic in the
>> optimizer phase? I imagine that it's difficult because the expressions in
>> InterpretedMutableProjection don't have access to the previous expression
>> results, only the input row, and that the design doesn't seem to be catered
>> for this.
>>
>>

Reply via email to