I'm not very sure.
I can imagine that there are cases where you might not wish to
increase the number of expressions (see example 1) and you might not
wish to push down expensive computations (see example 2 and 3).
Example 1. Given
Project(x + y, x - y, x * y, x / y, Filter(x < y, R))
it is not beneficial to push down to
Filter(x < 5, Project(x, y, x + y, x - y, x * y, x / y, R))
because [x, y] has fewer fields.
Example 2. Given
Project(expensiveFunction(w, x, y, z), Filter(x < y, R))
it is probably not beneficial to push down to
Filter(x < y, Project(x, y, expensiveFunction(w, x, y, z), R))
because you'd rather copy a few extra fields [w, x, y, z] than burn a
lot of CPU on more rows.
Example 3. Given
Project(expensiveFunction(w, x, y, z), Union(all, R1, R2))
it is beneficial to push down to
Union(all,
Project(expensiveFunction(w, x, y, z, R1),
Project(expensiveFunction(w, x, y, z, R2))
because expensiveFunction has to be invoked once per row whatever you do.
But other than that, pushing makes sense. In particular, if the
expressions are just field references, trimming almost always (unless
you are going to pay a price for rewriting the data).
And Minji Kim made some changes recently that made it safer to push
expressions through outer joins.
So I think we should be bullish. Can you log a JIRA case, make the
change in a PR, and see if anything breaks in the test suite? Then we
can ask stakeholders from Hive, Drill etc. to review and confirm that
it doesn't break anything on their side.
Julian
On Wed, May 10, 2017 at 7:58 PM, Gian Merlino <[email protected]> wrote:
> PushProjector has some code like this in "convertProject":
>
> if (
> (projRefs.cardinality() == nChildFields)
> && (childPreserveExprs.size() == 0)) {
> return null;
>
> Which, if I am following right, means when there aren't any preserve-exprs,
> we will only push a project past a filter if the project is going to cut
> down on the number of fields that go through the filter.
>
> What's the rationale for that? In one case Im looking at, a query like:
>
> SELECT dim1, SUM(cnt) / COUNT(*) GROUP BY dim1 HAVING SUM(cnt) / COUNT(*)
> = 1
>
> The plan ends up involving doing an aggregate -> filter on
> SUM(cnt)/COUNT(*) = 1 -> project [dim1, SUM(cnt)/COUNT(*)]. Wouldn't it be
> better to consider the possibility of doing this instead: aggregate ->
> project [dim1, SUM(cnt)/COUNT(*)] -> filter on $1 = 1? Then the division
> doesn't have to be done twice.
>
> My hidden motivation for asking: Druid's Calcite rules currently can't
> handle the former case but they probably would be able to handle the latter
> :)
>
> Gian