Re: [DISCUSS] Ongoing projects for Spark 4.0

Ángel Sat, 25 Jan 2025 23:07:16 -0800

I've just open the ticket  SPARK-50992
<https://issues.apache.org/jira/browse/SPARK-50992>
Could someone please review it? I'm proposing a new explain mode for
converting plans to strings. Currently, explaining plans with AQE enabled
is resource-intensive and can lead to memory accumulation in the heap. To
address this, I suggest introducing a new "off" mode that completely
disables plan string generation. Unless I'm mistaken, this mode should
ideally be the default configuration, replacing the current verbose
"formatted" mode.


El jue, 23 ene 2025 a las 7:12, Ángel (<angel.alvarez.pas...@gmail.com>)
escribió:

> Hi,
>
> I’m working on a performance issue that ends up throwing an
> OutOfMemoryError when AQE is enabled. This problem was first identified by
> Russel Jurney while running GraphFrames unit tests, as detailed in his
> gist <https://gist.github.com/rjurney/6abeffbd59c67df5e5243c8f6619b6bf>.
> The issue was also discussed in a related Spark mailing list thread
> <https://lists.apache.org/thread/kl50ryobwqlr93s6zwkhjp9rjsqkpwk0>. The
> problem has really nothing to do with GraphFrames specifically; instead, it
> arises from how Spark internally generates a massive physical plan and
> converts it into a String -the same plan, several times- during execution
> with cached DataFrames and AQE enabled.
>
> While I haven’t opened a Jira issue yet, I plan to do it shortly. Given
> its potential to affect many use cases, I believe it would be beneficial to
> address this issue in time for the Spark 4.0 release.
>
>
> Regards,
>
> Ángel
>
>
>
>
>
> El mié, 22 ene 2025 a las 23:17, Mich Talebzadeh (<
> mich.talebza...@gmail.com>) escribió:
>
>> Interesting points: client server architecture has been around since the
>> days of Sybase. A client written in any language, say Python, Scala makes a
>> request to spark cluster. This remote access model inherently creates a
>> level of isolation between the client application and the internal workings
>> of the Spark cluster. So this brings in certain benefits. However, this
>> client-server architecture introduces challenges when integrating features
>> like SQL Scripting, which might require deeper integration with the Spark
>> engine's internal mechanisms? Is that assertion correct and if so what are
>> the challenges in a nutshell that you referred to?
>>
>> HTH,
>>
>> Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Wed, 22 Jan 2025 at 20:47, David Milicevic
>> <david.milice...@databricks.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> Together with my team, I'm working on adding support for SQL Scripting (
>>> JIRA <https://issues.apache.org/jira/browse/SPARK-48338>, Ref Spec
>>> <https://docs.google.com/document/d/1uFv2VoqDoOH2k6HfgdkBYxp-Ou7Qi721eexNW6IYWwk/edit?pli=1&tab=t.0#heading=h.4cz970y1mk93>
>>> ).
>>> The feature is guarded by `spark.sql.scripting.enabled` SQL Conf because
>>> it's still in development, but some features are already available in OSS
>>> Spark - it's possible to execute scripts with all of the regular
>>> statements, as well as with newly added control flow statements - IF/ELSE,
>>> CASE, WHILE, REPEAT, LOOP, FOR, LEAVE, ITERATE, etc.
>>> SQL Scripting still doesn't work with Spark Connect.
>>>
>>> Thanks,
>>> David
>>>
>>> On Wed, Jan 22, 2025 at 12:25 PM Stefan Kandic
>>> <stefan.kan...@databricks.com.invalid> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am working on adding collation support (
>>>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-46830).
>>>>
>>>> Right now, collations are enabled by default as we have finished almost
>>>> everything we planned to add. However, there are still some smaller things
>>>> and improvements left that have ongoing efforts (setting default collation
>>>> on table/view/schema etc.)
>>>>
>>>> Regards,
>>>> Stefan
>>>>
>>>> On 2025/01/15 13:41:07 Wenchen Fan wrote:
>>>> > Hi all,
>>>> >
>>>> > We have cut the "branch-4.0" and I'm sending this email to collect the
>>>> > information for ongoing projects targeting Spark 4.0. Please reply to
>>>> this
>>>> > email to share the project progress with the community.
>>>> >
>>>> > Note that, the scheduled code freeze date is Feb 1, and RC1 cut date
>>>> is Feb
>>>> > 15.
>>>> >
>>>> > Thanks,
>>>> > Wenchen
>>>> >
>>>>
>>>

Re: [DISCUSS] Ongoing projects for Spark 4.0

Reply via email to