With complex types it doesn't work as well, but for primitive types the biggest benefit of whole stage codegen is that we don't even need to put the intermediate data into rows or columns anymore. They are just variables (stored in CPU registers).
On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers <ko...@tresata.com> wrote: > so i have been looking for a while now at all the catalyst expressions, > and all the relative complex codegen going on. > > so first off i get the benefit of codegen to turn a bunch of chained > iterators transformations into a single codegen stage for spark. that makes > sense to me, because it avoids a bunch of overhead. > > but what i am not so sure about is what the benefit is of converting the > actual stuff that happens inside the iterator transformations into codegen. > > say if we have an expression that has 2 children and creates a struct for > them. why would this be faster in codegen by re-creating the code to do > this in a string (which is complex and error prone) compared to simply have > the codegen call the normal method for this in my class? > > i see so much trivial code be re-created in codegen. stuff like this: > > private[this] def castToDateCode( > from: DataType, > ctx: CodegenContext): CastFunction = from match { > case StringType => > val intOpt = ctx.freshName("intOpt") > (c, evPrim, evNull) => s""" > scala.Option<Integer> $intOpt = > org.apache.spark.sql.catalyst.util.DateTimeUtils. > stringToDate($c); > if ($intOpt.isDefined()) { > $evPrim = ((Integer) $intOpt.get()).intValue(); > } else { > $evNull = true; > } > """ > > is this really faster than simply calling an equivalent functions from the > codegen, and keeping the codegen logic restricted to the "unrolling" of > chained iterators? > >