IIUC We are moving away from having 2 classes for Java and Scala, like JavaRDD and RDD. It's much simpler to maintain and use with a single class.
I don't have a strong preference over option 3 or 4. We may need to collect more data points from actual users. On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > Scala users are arguably more prevailing compared to Java users, yes. > Using the Java instances in Scala side is legitimate, and they are already > being used in multiple please. I don't believe Scala > users find this not Scala friendly as it's legitimate and already being > used. I personally find it's more trouble some to let Java > users to search which APIs to call. Yes, I understand the pros and cons - > we should also find the balance considering the actual usage. > > One more argument from me is, though, I think one of the goals in Spark > APIs is the unified API set up to my knowledge > e.g., JavaRDD <> RDD vs DataFrame. > If either way is not particularly preferred over the other, I would just > choose the one to have the unified API set. > > > > 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성: > >> I agree a general guidance is good so we keep consistent in the apis. I >> don't necessarily agree that 4 is the best solution though. I agree its >> nice to have one api, but it is less friendly for the scala side. >> Searching for the equivalent Java api shouldn't be hard as it should be >> very close in the name and if we make it a general rule users should >> understand it. I guess one good question is what API do most of our users >> use between Java and Scala and what is the ratio? I don't know the answer >> to that. I've seen more using Scala over Java. If the majority use Scala >> then I think the API should be more friendly to that. >> >> Tom >> >> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon < >> gurwls...@gmail.com> wrote: >> >> >> Hi all, >> >> I would like to discuss Java specific APIs and which design we will >> choose. >> This has been discussed in multiple places so far, for example, at >> https://github.com/apache/spark/pull/28085#discussion_r407334754 >> >> >> *The problem:* >> >> In short, I would like us to have clear guidance on how we support Java >> specific APIs when >> it requires to return a Java instance. The problem is simple: >> >> def requests: Map[String, ExecutorResourceRequest] = ... >> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ... >> >> vs >> >> def requests: java.util.Map[String, ExecutorResourceRequest] = ... >> >> >> *Current codebase:* >> >> My understanding so far was that the latter is preferred and more >> consistent and prevailing in the >> existing codebase, for example, see StateOperatorProgress and >> StreamingQueryProgress in Structured Streaming. >> However, I realised that we also have other approaches in the current >> codebase. There look >> four approaches to deal with Java specifics in general: >> >> 1. Java specific classes such as JavaRDD and JavaSparkContext. >> 2. Java specific methods with the same name that overload its >> parameters, see functions.scala. >> 3. Java specific methods with a different name that needs to return a >> different type such as TaskContext.resourcesJMap vs >> TaskContext.resources. >> 4. One method that returns a Java instance for both Scala and Java >> sides. see StateOperatorProgress and StreamingQueryProgress. >> >> >> *Analysis on the current codebase:* >> >> I agree with 2. approach because the corresponding cases give you a >> consistent API usage across >> other language APIs in general. Approach 1. is from the old world when we >> didn't have unified APIs. >> This might be the worst approach. >> >> 3. and 4. are controversial. >> >> For 3., if you have to use Java APIs, then, you should search if there is >> a variant of that API >> every time specifically for Java APIs. But yes, it gives you Java/Scala >> friendly instances. >> >> For 4., having one API that returns a Java instance makes you able to use >> it in both Scala and Java APIs >> sides although it makes you call asScala in Scala side specifically. But >> you don’t >> have to search if there’s a variant of this API and it gives you a >> consistent API usage across languages. >> >> Also, note that calling Java in Scala is legitimate but the opposite case >> is not, up to my best knowledge. >> In addition, you should have a method that returns a Java instance for >> PySpark or SparkR to support. >> >> >> *Proposal:* >> >> I would like to have a general guidance on this that the Spark dev agrees >> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost. >> >> Note that this isn't a hard requirement but *a general guidance*; >> therefore, the decision might be up to >> the specific context. For example, when there are some strong arguments >> to have a separate Java specific API, that’s fine. >> Of course, we won’t change the existing methods given Micheal’s rubric >> added before. I am talking about new >> methods in unreleased branches. >> >> Any concern or opinion on this? >> >