Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
Yeah that's fine. I'll revert and open a fresh PR including my own followup when I get back home later today. On Sat, Jul 13, 2024 at 3:08 PM Holden Karau wrote: > Even if the change is reasonable (and I can see arguments both ways), it's > important that we follow the process we agreed on. Merg

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
We actually get the active Spark session so it doesn't cause overhead. Also even we create, it will create once which should be pretty trivial overhead. I don't think we can deprecate RDD API IMHO in any event. On Sat, Jul 13, 2024 at 1:30 PM Martin Grund wrote: > Mridul, I really just wanted t

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Martin Grund
Mridul, I really just wanted to understand the concern from Dongjoon. What you're pointing at is a slightly different concern. So what I see is the following: > [...] they can initialize a SparkContext and work with RDD api: The current PR uses a potentially optional value without checking that i

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
I made a followup (https://github.com/apache/spark/pull/47341) to address my own concerns. Please let me know if there are additional concerns. We could further discuss it there. I am also fine with reverting it and starting it from scratch if that's preferred. On Sat, 13 Jul 2024 at 11:52, Ruife

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Ruifeng Zheng
My bad, as the reviewer, I should review the PR description more closely. I think it is a good change to replace spark context based implementation with spark session, and if I recall correctly there were some similar attempts in MLLib in the past. On Sat, Jul 13, 2024 at 9:42 AM Hyukjin Kwon w

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Holden Karau
Even if the change is reasonable (and I can see arguments both ways), it's important that we follow the process we agreed on. Merging a PR without discussion* in ~ 2 hours from the initial proposal is not enough time to reach a lazy consensus. If it was a small bug-fix I could understand but this w

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Hyukjin Kwon
I think we should have not mentioned a specific vendor there. The change also shouldn't repartition. We should create a partition 1. But in general leveraging Catalyst optimizer and SQL engine there is a good idea as we can leverage all optimization there. For example, it will use UTF8 encoding in

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Holden Karau
My bad I meant to say I believe the provided justification is inappropriate. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Fri, Ju

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Holden Karau
So looking at the PR it does not appear to be removing any RDD APIs but the justification provided for changing the ML backend to use the DataFrame APIs is indeed concerning. This PR appears to have been merged without proper review (or providing an opportunity for review). I’d like to remind peo

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Mridul Muralidharan
It is not necessary for users to create a SparkSession Martin - they can initialize a SparkContext and work with RDD api: which would be what Dongjoon is referring to IMO. Even after Spark Connect GA, I am not in favor of deprecating RDD Api at least until we have parity between both (which we don

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Martin Grund
I took a quick look at the PR and would like to understand your concern better about: > SparkSession is heavier than SparkContext It looks like the PR is using the active SparkSession, not creating a new one etc. I would highly appreciate it if you could help me understand this situation better.

[DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-12 Thread Dongjoon Hyun
Hi, All. Apache Spark's RDD API plays an essential and invaluable role from the beginning and it will be even if it's not supported by Spark Connect. I have a concern about a recent activity which replaces RDD with SparkSession blindly. For instance, https://github.com/apache/spark/pull/47328 [

Re: [DISCUSS] Auto scaling support for structured streaming

2024-07-12 Thread Nimrod Ofek
Hi, Anyone? Scaling for different loads in a structured streaming app should be a trivial requirement for users... Thanks! Nimrod בתאריך יום ג׳, 9 ביולי 2024, 10:20, מאת Nimrod Ofek ‏: > PMC members, can someone please push this thing forward? > > Thanks! > Nimrod > > בתאריך יום ג׳, 9 ביולי 202

Re: [DISCUSS] Release Apache Spark 3.5.2

2024-07-12 Thread Kent Yao
Thank you everyone for the positive feedback. A special thanks to Dongjoon for offering to help. xianjin 于2024年7月12日周五 15:12写道: > > +1. > Sent from my iPhone > > > On Jul 12, 2024, at 3:06 PM, L. C. Hsieh wrote: > > > > +1 > > > >> On Thu, Jul 11, 2024 at 3:22 PM Zhou Jiang wrote: > >> > >>

Re: [DISCUSS] Release Apache Spark 3.5.2

2024-07-12 Thread xianjin
+1. Sent from my iPhone > On Jul 12, 2024, at 3:06 PM, L. C. Hsieh wrote: > > +1 > >> On Thu, Jul 11, 2024 at 3:22 PM Zhou Jiang wrote: >> >> +1 for releasing 3.5.2, which would also benefit the Spark Operator >> multi-version support. >> >>> On Thu, Jul 11, 2024 at 7:56 AM Dongjoon Hyun