Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Mon, 29 Jul 2024 12:25:57 -0700

Let me clarify them. This is TL;DR:

- The PR (https://github.com/apache/spark/pull/47328) mentioned Databricks
in the PR description (it is edited now), which had to be avoided.
- I also updated committer guidelines in spark-website to prevent such
cases in the future.
- Otherwise, the change was reverted, and merged back after having proper
reviews.
- The changes are legitimate, and benefit all users, not specific to a
single vendor.



On Mon, 29 Jul 2024 at 15:13, Sean Owen <sro...@gmail.com> wrote:

> Also from ASF community perspective -
>
> I think all are agreed this was merged too fast. But, I'm missing where
> this is somehow due to the needs of a single vendor. Where is this related
> to file systems or keys?
> did I miss it from another discussion or PR, or is this actually about a
> different issue?
>
> Otherwise I don't see what this lecture is about. The issue that was
> raised (existing Spark Session) is, I agree, not an issue.
>
>
> On Mon, Jul 29, 2024 at 12:43 PM Steve Loughran
> <ste...@cloudera.com.invalid> wrote:
>
>>
>> I'm going to join in from an ASF community perspective.
>>
>> Nobody should be making fundamental changes to an ASF code base with a PR
>> up and then merged two hours later because of the needs of a single vendor
>> of a downstream product. This doesn't even give people in different time
>> zones the chance to review it. It goes completely against the concept of
>> "community" and replaces it with private problems, not shared with anyone,
>> and large pieces of development work to address them without any
>> opportunity for others to improve. Pieces of work which presumably must
>> have been ongoing for some days.
>>
>> I know doing stuff in public is time-consuming as you have to spend a lot
>> of time chasing reviews, but collaboration is essential as it ensures that
>> changes meet the needs of a broader community than one single vendor.
>> Avoiding that is exclusively and unhealthy for a project.
>>
>> If the databricks products have some problem resolving user:key secrets
>> in paths in the virtual file system, that will be good to know, especially
>> the what and the why -as others may encounter it too. At the very least:
>> others should know what to do so as to avoid getting into the same
>> situation.
>>
>> If you want more nimble development, well, closed source gives you that.
>> Switching to commit-then-review on specific ASF repos is also allowed,
>> despite the inherent risks. We use it for some of her hadoop release
>> packaging/testing for a rapid iteration of release process automation and
>> validation code.
>>
>> Anyway, the patch has been reverted and discussions are now ongoing, as
>> they should have been from the outset.
>>
>> Steve
>>
>>
>> On Wed, 24 Jul 2024 at 01:29, Hyukjin Kwon <gurwls...@apache.org> wrote:
>>
>>> There is always a running session. I replied in the PR.
>>>
>>> On Tue, 23 Jul 2024 at 23:32, Dongjoon Hyun <dongj...@apache.org> wrote:
>>>
>>>> I'm bumping up this thread because the overhead bites us back already.
>>>> Here is a commit merged 3 hours ago.
>>>>
>>>> https://github.com/apache/spark/pull/47453
>>>> [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in
>>>> spark ML reader/writer
>>>>
>>>> In short, unlike the original PRs' claims, this commit starts to create
>>>> `SparkSession` in this layer. Although I understand the reason why Hyukjin
>>>> and Martin claims that `SparkSession` will be there in any way, this is an
>>>> architectural change which we need to decide explicitly, not implicitly.
>>>>
>>>> > On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
>>>> > We actually get the active Spark session so it doesn't cause
>>>> overhead. Also
>>>> > even we create, it will create once which should be pretty trivial
>>>> overhead.
>>>>
>>>> If this architectural change is required inevitably and needs to happen
>>>> in Apache Spark 4.0.0. Can we have a dev-document about this? If there is
>>>> no proper place, we can add it to the ML migration guide simply.
>>>>
>>>> Dongjoon.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to