We actually get the active Spark session so it doesn't cause overhead. Also even we create, it will create once which should be pretty trivial overhead.
I don't think we can deprecate RDD API IMHO in any event. On Sat, Jul 13, 2024 at 1:30 PM Martin Grund <mar...@databricks.com.invalid> wrote: > Mridul, I really just wanted to understand the concern from Dongjoon. What > you're pointing at is a slightly different concern. So what I see is the > following: > > > [...] they can initialize a SparkContext and work with RDD api: > > The current PR uses a potentially optional value without checking that it > is set. (Which is what would happen if you just have a SparkContext and no > SparkSession). > > I understand that this can happen when someone creates a Spark job and > uses no other Spark APIs to begin with. But in the context of using the > current Spark ML implementation, is it actually possible to end up in this > situation? I'm really just trying to understand the system's invariants. > > > [...] SparkSession is heavier than SparkContext > > Assuming that, for whatever reason, a SparkSession was created. Is there a > downside to using it? > > Please see my questions as independent of the RDD API discussion itself, > and I don't think this PR was even meant to be put in the context of any > Spark Connect work. > > On Fri, Jul 12, 2024 at 11:58 PM Mridul Muralidharan <mri...@gmail.com> > wrote: > >> >> It is not necessary for users to create a SparkSession Martin - they can >> initialize a SparkContext and work with RDD api: which would be what >> Dongjoon is referring to IMO. >> >> Even after Spark Connect GA, I am not in favor of deprecating RDD Api at >> least until we have parity between both (which we don’t have today), and we >> have vetted this parity over the course of a few minor releases. >> >> >> Regards, >> Mridul >> >> >> >> On Fri, Jul 12, 2024 at 4:19 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Hi, All. >>> >>> Apache Spark's RDD API plays an essential and invaluable role from the >>> beginning and it will be even if it's not supported by Spark Connect. >>> >>> I have a concern about a recent activity which replaces RDD with >>> SparkSession blindly. >>> >>> For instance, >>> >>> https://github.com/apache/spark/pull/47328 >>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>> Dataframe read / write API >>> >>> This PR doesn't look proper to me in two ways. >>> - SparkSession is heavier than SparkContext >>> - According to the following PR description, the background is also >>> hidden in the community. >>> >>> > # Why are the changes needed? >>> > In databricks runtime, RDD read / write API has some issue for >>> certain storage types >>> > that requires the account key, but Dataframe read / write API works. >>> >>> In addition, we don't know if this PR fixes the mentioned unknown >>> storage's issue or not because it's not testable in the community test >>> coverage. >>> >>> I'm wondering if the Apache Spark community aims to move away from the >>> RDD usage in favor of `Spark Connect`. Isn't it too early because `Spark >>> Connect` is not even GA in the community? >>> >>> >>> Dongjoon. >>> >>