[
https://issues.apache.org/jira/browse/SPARK-47672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Holden Karau updated SPARK-47672:
---------------------------------
Fix Version/s: 4.2.0
> Avoid double evaluation of non-trivial projected elements from filter pushdown
> ------------------------------------------------------------------------------
>
> Key: SPARK-47672
> URL: https://issues.apache.org/jira/browse/SPARK-47672
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.5.1
> Reporter: Holden Karau
> Assignee: Holden Karau
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> Repro here [https://gist.github.com/holdenk/0f9660bcbd9e63aaff904f15d3439db1]
>
> You can work around this by setting an expensive UDF to non-deterministic but
> that's not ideal and won't fix expensive internal operations (like string
> matching).
>
> Instead when we go to bubble up a filter, if we should not move a filter up
> above a projection of what we are filtering on.
>
> https://issues.apache.org/jira/browse/SPARK-40045 partially fixed some of
> this by (roughly) ordering filter expressions by cost so that we're not
> evaluating more than ~2x (e.g. in old behavior bubbled up filter could become
> the first elem of the filter and then the cheap null checks would go away and
> we'd have expensive compute on everything not just filtered data), but we
> should "trust" the users projection + later use of that projection to
> indicate that a UDF is expensive and we should only evaluate it once inside
> of the projection and filter after.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]