Catalyst: Reusing already computed expressions within a projection

Justin Uang Sat, 30 May 2015 08:32:53 -0700

If I do the following

    df2 = df.withColumn('y', df['x'] * 7)
    df3 = df2.withColumn('z', df2.y * 3)
    df3.explain()


Then the result is

    > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59
* 7.0) AS y#64 * 3.0) AS z#65]
    >  PhysicalRDD [date#56,id#57,timestamp#58,x#59], MapPartitionsRDD[125]
at mapPartitions at SQLContext.scala:1163

Effectively I want to compute

    y = f(x)
    z = g(y)

The catalyst optimizer realizes that y#64 is the same as the one previously
computed, however, when building the projection, it is ignoring the fact
that it had already computed y, so it calculates `x * 7` twice.

    y = x * 7
    z = x * 7 * 3

If I wanted to make this fix, would it be possible to do the logic in the
optimizer phase? I imagine that it's difficult because the expressions in
InterpretedMutableProjection don't have access to the previous expression
results, only the input row, and that the design doesn't seem to be catered
for this.

Catalyst: Reusing already computed expressions within a projection

Reply via email to