I think Michael's bringing up code gen because the compiler (not Spark, but
javac and JVM JIT) already does common subexpression elimination, so we
might get it for free during code gen.
On Sun, May 31, 2015 at 11:48 AM, Justin Uang wrote:
> Thanks for pointing to that link! It looks like it’s
Thanks for pointing to that link! It looks like it’s useful, but it does
look more complicated than the case I’m trying to address.
In my case, we set y = f(x), then we use y later on in future projections (z
= g(y)). In that case, the analysis is trivial in that we aren’t trying to
find equivalen
I think this is likely something that we'll want to do during the code
generation phase. Though its probably not the lowest hanging fruit at this
point.
On Sun, May 31, 2015 at 5:02 AM, Reynold Xin wrote:
> I think you are looking for
> http://en.wikipedia.org/wiki/Common_subexpression_eliminat
I think you are looking for
http://en.wikipedia.org/wiki/Common_subexpression_elimination in the
optimizer.
One thing to note is that as we do more and more optimization like this,
the optimization time might increase. Do you see a case where this can
bring you substantial performance gains?
On
On second thought, perhaps can this be done by writing a rule that builds
the dag of dependencies between expressions, then convert it into several
layers of projections, where each new layer is allowed to depend on
expression results from previous projections?
Are there any pitfalls to this appro
If I do the following
df2 = df.withColumn('y', df['x'] * 7)
df3 = df2.withColumn('z', df2.y * 3)
df3.explain()
Then the result is
> Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59
* 7.0) AS y#64 * 3.0) AS z#65]
> PhysicalRDD [date#56,id#57,timestamp#58,x