[ 
https://issues.apache.org/jira/browse/SPARK-47672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-47672:
---------------------------------
    Fix Version/s: 4.2.0

> Avoid double evaluation of non-trivial projected elements from filter pushdown
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-47672
>                 URL: https://issues.apache.org/jira/browse/SPARK-47672
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.1
>            Reporter: Holden Karau
>            Assignee: Holden Karau
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> Repro here [https://gist.github.com/holdenk/0f9660bcbd9e63aaff904f15d3439db1] 
>  
> You can work around this by setting an expensive UDF to non-deterministic but 
> that's not ideal and won't fix expensive internal operations (like string 
> matching).
>  
> Instead when we go to bubble up a filter, if we should not move a filter up 
> above a projection of what we are filtering on.
>  
> https://issues.apache.org/jira/browse/SPARK-40045 partially fixed some of 
> this by (roughly) ordering filter expressions by cost so that we're not 
> evaluating more than ~2x (e.g. in old behavior bubbled up filter could become 
> the first elem of the filter and then the cheap null checks would go away and 
> we'd have expensive compute on everything not just filtered data), but we 
> should "trust" the users projection + later use of that projection to 
> indicate that a UDF is expensive and we should only evaluate it once inside 
> of the projection and filter after.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to