I'm bumping up this thread because the overhead bites us back already. Here is a commit merged 3 hours ago.
https://github.com/apache/spark/pull/47453 [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer In short, unlike the original PRs' claims, this commit starts to create `SparkSession` in this layer. Although I understand the reason why Hyukjin and Martin claims that `SparkSession` will be there in any way, this is an architectural change which we need to decide explicitly, not implicitly. > On 2024/07/13 05:33:32 Hyukjin Kwon wrote: > We actually get the active Spark session so it doesn't cause overhead. Also > even we create, it will create once which should be pretty trivial overhead. If this architectural change is required inevitably and needs to happen in Apache Spark 4.0.0. Can we have a dev-document about this? If there is no proper place, we can add it to the ML migration guide simply. Dongjoon. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org