Also from ASF community perspective - I think all are agreed this was merged too fast. But, I'm missing where this is somehow due to the needs of a single vendor. Where is this related to file systems or keys? did I miss it from another discussion or PR, or is this actually about a different issue?
Otherwise I don't see what this lecture is about. The issue that was raised (existing Spark Session) is, I agree, not an issue. On Mon, Jul 29, 2024 at 12:43 PM Steve Loughran <ste...@cloudera.com.invalid> wrote: > > I'm going to join in from an ASF community perspective. > > Nobody should be making fundamental changes to an ASF code base with a PR > up and then merged two hours later because of the needs of a single vendor > of a downstream product. This doesn't even give people in different time > zones the chance to review it. It goes completely against the concept of > "community" and replaces it with private problems, not shared with anyone, > and large pieces of development work to address them without any > opportunity for others to improve. Pieces of work which presumably must > have been ongoing for some days. > > I know doing stuff in public is time-consuming as you have to spend a lot > of time chasing reviews, but collaboration is essential as it ensures that > changes meet the needs of a broader community than one single vendor. > Avoiding that is exclusively and unhealthy for a project. > > If the databricks products have some problem resolving user:key secrets in > paths in the virtual file system, that will be good to know, especially the > what and the why -as others may encounter it too. At the very least: others > should know what to do so as to avoid getting into the same situation. > > If you want more nimble development, well, closed source gives you that. > Switching to commit-then-review on specific ASF repos is also allowed, > despite the inherent risks. We use it for some of her hadoop release > packaging/testing for a rapid iteration of release process automation and > validation code. > > Anyway, the patch has been reverted and discussions are now ongoing, as > they should have been from the outset. > > Steve > > > On Wed, 24 Jul 2024 at 01:29, Hyukjin Kwon <gurwls...@apache.org> wrote: > >> There is always a running session. I replied in the PR. >> >> On Tue, 23 Jul 2024 at 23:32, Dongjoon Hyun <dongj...@apache.org> wrote: >> >>> I'm bumping up this thread because the overhead bites us back already. >>> Here is a commit merged 3 hours ago. >>> >>> https://github.com/apache/spark/pull/47453 >>> [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in >>> spark ML reader/writer >>> >>> In short, unlike the original PRs' claims, this commit starts to create >>> `SparkSession` in this layer. Although I understand the reason why Hyukjin >>> and Martin claims that `SparkSession` will be there in any way, this is an >>> architectural change which we need to decide explicitly, not implicitly. >>> >>> > On 2024/07/13 05:33:32 Hyukjin Kwon wrote: >>> > We actually get the active Spark session so it doesn't cause overhead. >>> Also >>> > even we create, it will create once which should be pretty trivial >>> overhead. >>> >>> If this architectural change is required inevitably and needs to happen >>> in Apache Spark 4.0.0. Can we have a dev-document about this? If there is >>> no proper place, we can add it to the ML migration guide simply. >>> >>> Dongjoon. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>