[jira] [Updated] (SPARK-45959) SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf degradation

Asif (Jira) Wed, 26 Feb 2025 18:07:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Asif updated SPARK-45959:
-------------------------
     Target Version/s: 4.1  (was: 3.5.1)
    Affects Version/s: 4.1

> SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf 
> degradation
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-45959
>                 URL: https://issues.apache.org/jira/browse/SPARK-45959
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0, 3.5.1, 4.1
>            Reporter: Asif
>            Priority: Major
>              Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to Collapse the Projects early, once the analysis of 
> the logical plan is done, but before the plan gets assigned to the field 
> variable in QueryExecution.
> The PR for the above is ready for review.
> The major change is in the way the lookup is performed in CacheManager.
> I have described the logic in the PR and have added multiple tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45959) SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf degradation

Reply via email to