Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Sean Owen Mon, 29 Jul 2024 10:55:03 -0700

Also from ASF community perspective -

I think all are agreed this was merged too fast. But, I'm missing where
this is somehow due to the needs of a single vendor. Where is this related
to file systems or keys?
did I miss it from another discussion or PR, or is this actually about a
different issue?


Otherwise I don't see what this lecture is about. The issue that was raised
(existing Spark Session) is, I agree, not an issue.


On Mon, Jul 29, 2024 at 12:43 PM Steve Loughran <[email protected]>
wrote:

>
> I'm going to join in from an ASF community perspective.
>
> Nobody should be making fundamental changes to an ASF code base with a PR
> up and then merged two hours later because of the needs of a single vendor
> of a downstream product. This doesn't even give people in different time
> zones the chance to review it. It goes completely against the concept of
> "community" and replaces it with private problems, not shared with anyone,
> and large pieces of development work to address them without any
> opportunity for others to improve. Pieces of work which presumably must
> have been ongoing for some days.
>
> I know doing stuff in public is time-consuming as you have to spend a lot
> of time chasing reviews, but collaboration is essential as it ensures that
> changes meet the needs of a broader community than one single vendor.
> Avoiding that is exclusively and unhealthy for a project.
>
> If the databricks products have some problem resolving user:key secrets in
> paths in the virtual file system, that will be good to know, especially the
> what and the why -as others may encounter it too. At the very least: others
> should know what to do so as to avoid getting into the same situation.
>
> If you want more nimble development, well, closed source gives you that.
> Switching to commit-then-review on specific ASF repos is also allowed,
> despite the inherent risks. We use it for some of her hadoop release
> packaging/testing for a rapid iteration of release process automation and
> validation code.
>
> Anyway, the patch has been reverted and discussions are now ongoing, as
> they should have been from the outset.
>
> Steve
>
>
> On Wed, 24 Jul 2024 at 01:29, Hyukjin Kwon <[email protected]> wrote:
>
>> There is always a running session. I replied in the PR.
>>
>> On Tue, 23 Jul 2024 at 23:32, Dongjoon Hyun <[email protected]> wrote:
>>
>>> I'm bumping up this thread because the overhead bites us back already.
>>> Here is a commit merged 3 hours ago.
>>>
>>> https://github.com/apache/spark/pull/47453
>>> [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in
>>> spark ML reader/writer
>>>
>>> In short, unlike the original PRs' claims, this commit starts to create
>>> `SparkSession` in this layer. Although I understand the reason why Hyukjin
>>> and Martin claims that `SparkSession` will be there in any way, this is an
>>> architectural change which we need to decide explicitly, not implicitly.
>>>
>>> > On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
>>> > We actually get the active Spark session so it doesn't cause overhead.
>>> Also
>>> > even we create, it will create once which should be pretty trivial
>>> overhead.
>>>
>>> If this architectural change is required inevitably and needs to happen
>>> in Apache Spark 4.0.0. Can we have a dev-document about this? If there is
>>> no proper place, we can add it to the ML migration guide simply.
>>>
>>> Dongjoon.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to