Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Dongjoon Hyun Tue, 23 Jul 2024 07:35:18 -0700

I'm bumping up this thread because the overhead bites us back already. Here is 
a commit merged 3 hours ago.

https://github.com/apache/spark/pull/47453
[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML 
reader/writer

In short, unlike the original PRs' claims, this commit starts to create 
`SparkSession` in this layer. Although I understand the reason why Hyukjin and 
Martin claims that `SparkSession` will be there in any way, this is an 
architectural change which we need to decide explicitly, not implicitly.

> On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
> We actually get the active Spark session so it doesn't cause overhead. Also
> even we create, it will create once which should be pretty trivial overhead.

If this architectural change is required inevitably and needs to happen in 
Apache Spark 4.0.0. Can we have a dev-document about this? If there is no 
proper place, we can add it to the ML migration guide simply.

Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to