[ https://issues.apache.org/jira/browse/SPARK-50992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ángel Álvarez Pascua updated SPARK-50992: ----------------------------------------- Description: When AQE is enabled, Spark triggers update events to the internal listener bus whenever a plan changes. These events include a plain-text description of the plan, which is computationally expensive to generate for large plans. {*}Key Issues:{*}{*}{{*}} *1. High Cost of Plan String Calculation:* * Generating the string description for large physical plans is a costly operation. * This impacts performance, particularly in complex workflows with frequent plan updates (e.g. persisting DataFrames). *2. Out-of-Memory (OOM) Errors:* * Events are stored in the listener bus as {{SQLExecutionUIData}} objects and retained until a threshold is reached. * This retention behavior can lead to memory exhaustion when processing large plans, causing OOM errors. *Current Workarounds Are Ineffective:* * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan string calculations. * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate OOMs but does not eliminate the overhead of string generation. * *Available Explain Modes:* All existing explain modes are verbose and computationally expensive, failing to resolve these issues. *Proposed Solution:* Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation of plan string descriptions. * When this mode is enabled, Spark skips the calculation of plan descriptions altogether. * This resolves OOM errors and restores performance parity with non-AQE execution. *Impact of Proposed Solution:* * Eliminates OOMs in large plans with AQE enabled. * Reduces the performance overhead associated with plan string generation. * Ensures Spark scales better in environments with large, complex plans. *Reproducibility:* A test reproducing the issue has been attached. was: When AQE is enabled, Spark triggers update events to the internal listener bus whenever a plan changes. These events include a plain-text description of the plan, which is computationally expensive to generate for large plans. {*}Key Issues:{*}{*}{*} *1. High Cost of Plan String Calculation:* * Generating the string description for large physical plans is a costly operation. * This impacts performance, particularly in complex workflows with frequent plan updates (e.g. persisting DataFrames). *2. Out-of-Memory (OOM) Errors:* * Events are stored in the listener bus as {{SQLExecutionUIData}} objects and retained until a threshold is reached. * This retention behavior can lead to memory exhaustion when processing large plans, causing OOM errors. *Current Workarounds Are Ineffective:* * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan string calculations. * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate OOMs but does not eliminate the overhead of string generation. * *Available Explain Modes:* All existing explain modes are verbose and computationally expensive, failing to resolve these issues. * *Proposed Solution:* Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation of plan string descriptions. * When this mode is enabled, Spark skips the calculation of plan descriptions altogether. * This resolves OOM errors and restores performance parity with non-AQE execution. *Impact of Proposed Solution:* * Eliminates OOMs in large plans with AQE enabled. * Reduces the performance overhead associated with plan string generation. * Ensures Spark scales better in environments with large, complex plans. *Reproducibility:* The following test replicates the issue has been attached. > OOMs and performance issues with AQE in large plans > --------------------------------------------------- > > Key: SPARK-50992 > URL: https://issues.apache.org/jira/browse/SPARK-50992 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 4.0.0, 3.5.3, 3.5.4 > Reporter: Ángel Álvarez Pascua > Priority: Major > Attachments: Main.scala > > > When AQE is enabled, Spark triggers update events to the internal listener > bus whenever a plan changes. These events include a plain-text description of > the plan, which is computationally expensive to generate for large plans. > {*}Key Issues:{*}{*}{{*}} > *1. High Cost of Plan String Calculation:* > * Generating the string description for large physical plans is a costly > operation. > * This impacts performance, particularly in complex workflows with frequent > plan updates (e.g. persisting DataFrames). > *2. Out-of-Memory (OOM) Errors:* > * Events are stored in the listener bus as {{SQLExecutionUIData}} objects > and retained until a threshold is reached. > * This retention behavior can lead to memory exhaustion when processing > large plans, causing OOM errors. > > *Current Workarounds Are Ineffective:* > * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): > Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan > string calculations. > * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): > Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate > OOMs but does not eliminate the overhead of string generation. > * *Available Explain Modes:* All existing explain modes are verbose and > computationally expensive, failing to resolve these issues. > > *Proposed Solution:* > Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation > of plan string descriptions. > * When this mode is enabled, Spark skips the calculation of plan > descriptions altogether. > * This resolves OOM errors and restores performance parity with non-AQE > execution. > > *Impact of Proposed Solution:* > * Eliminates OOMs in large plans with AQE enabled. > * Reduces the performance overhead associated with plan string generation. > * Ensures Spark scales better in environments with large, complex plans. > > *Reproducibility:* > A test reproducing the issue has been attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org