Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Tue, 23 Jul 2024 17:26:58 -0700

There is always a running session. I replied in the PR.

On Tue, 23 Jul 2024 at 23:32, Dongjoon Hyun <dongj...@apache.org> wrote:


> I'm bumping up this thread because the overhead bites us back already.
> Here is a commit merged 3 hours ago.
>
> https://github.com/apache/spark/pull/47453
> [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in
> spark ML reader/writer
>
> In short, unlike the original PRs' claims, this commit starts to create
> `SparkSession` in this layer. Although I understand the reason why Hyukjin
> and Martin claims that `SparkSession` will be there in any way, this is an
> architectural change which we need to decide explicitly, not implicitly.
>
> > On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
> > We actually get the active Spark session so it doesn't cause overhead.
> Also
> > even we create, it will create once which should be pretty trivial
> overhead.
>
> If this architectural change is required inevitably and needs to happen in
> Apache Spark 4.0.0. Can we have a dev-document about this? If there is no
> proper place, we can add it to the ML migration guide simply.
>
> Dongjoon.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to