> One thing we could do here is use Java collections internally and make the Scala API a thin wrapper around Java -- like how Python works. > Then adding a method to the Scala API would require adding it to the Java API and we would keep the two more in sync.
I think it can be an appropriate idea for when we have to deal with this case a lot but I don't think there are so many user-facing APIs to return a Java collections, it's rather rare. Also, the Java users are relatively less than Scala users. This case is slightly different from Python in a way that there are so many differences to deal with in PySpark case. Also, in case of `Seq`, actually we can just use `Array` instead for both Scala and Java side simply. I don't find such cases notably awkward. This problematic cases might be specific to few Java collections or instances, and I would like to avoid an overkill here. Of course, if there is a place to consider other options, let's do. I don't like to say this is the only required option. 2020년 4월 28일 (화) 오전 1:18, Ryan Blue <rb...@netflix.com.invalid>님이 작성: > I think the right choice here depends on how the object is used. For > developer and internal APIs, I think standardizing on Java collections > makes the most sense. > > For user-facing APIs, it is awkward to return Java collections to Scala > code -- I think that's the motivation for Tom's comment. For user APIs, I > think most methods should return Scala collections, and I don't have a > strong opinion about whether the conversion (or lack thereof) is done in a > separate object (#1) or in parallel methods (#3). > > Both #1 and #3 seem like about the same amount of work and have the same > likelihood that a developer will leave out a Java method version. One thing > we could do here is use Java collections internally and make the Scala API > a thin wrapper around Java -- like how Python works. Then adding a method > to the Scala API would require adding it to the Java API and we would keep > the two more in sync. It would also help avoid Scala collections leaking > into internals. > > On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Let's stick to the less maintenance efforts then rather than we leave it >> undecided and delay with leaving this inconsistency. >> >> I dont think we can have some very meaningful data about this soon given >> that we don't hear much complaints about this in general so far. >> >> The point of this thread is to make a call rather then defer to the >> future. >> >> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com> wrote: >> >>> IIUC We are moving away from having 2 classes for Java and Scala, like >>> JavaRDD and RDD. It's much simpler to maintain and use with a single class. >>> >>> I don't have a strong preference over option 3 or 4. We may need to >>> collect more data points from actual users. >>> >>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>>> Scala users are arguably more prevailing compared to Java users, yes. >>>> Using the Java instances in Scala side is legitimate, and they are >>>> already being used in multiple please. I don't believe Scala >>>> users find this not Scala friendly as it's legitimate and already being >>>> used. I personally find it's more trouble some to let Java >>>> users to search which APIs to call. Yes, I understand the pros and cons >>>> - we should also find the balance considering the actual usage. >>>> >>>> One more argument from me is, though, I think one of the goals in Spark >>>> APIs is the unified API set up to my knowledge >>>> e.g., JavaRDD <> RDD vs DataFrame. >>>> If either way is not particularly preferred over the other, I would >>>> just choose the one to have the unified API set. >>>> >>>> >>>> >>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성: >>>> >>>>> I agree a general guidance is good so we keep consistent in the apis. >>>>> I don't necessarily agree that 4 is the best solution though. I agree its >>>>> nice to have one api, but it is less friendly for the scala side. >>>>> Searching for the equivalent Java api shouldn't be hard as it should be >>>>> very close in the name and if we make it a general rule users should >>>>> understand it. I guess one good question is what API do most of our >>>>> users >>>>> use between Java and Scala and what is the ratio? I don't know the answer >>>>> to that. I've seen more using Scala over Java. If the majority use Scala >>>>> then I think the API should be more friendly to that. >>>>> >>>>> Tom >>>>> >>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon < >>>>> gurwls...@gmail.com> wrote: >>>>> >>>>> >>>>> Hi all, >>>>> >>>>> I would like to discuss Java specific APIs and which design we will >>>>> choose. >>>>> This has been discussed in multiple places so far, for example, at >>>>> https://github.com/apache/spark/pull/28085#discussion_r407334754 >>>>> >>>>> >>>>> *The problem:* >>>>> >>>>> In short, I would like us to have clear guidance on how we support >>>>> Java specific APIs when >>>>> it requires to return a Java instance. The problem is simple: >>>>> >>>>> def requests: Map[String, ExecutorResourceRequest] = ... >>>>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ... >>>>> >>>>> vs >>>>> >>>>> def requests: java.util.Map[String, ExecutorResourceRequest] = ... >>>>> >>>>> >>>>> *Current codebase:* >>>>> >>>>> My understanding so far was that the latter is preferred and more >>>>> consistent and prevailing in the >>>>> existing codebase, for example, see StateOperatorProgress and >>>>> StreamingQueryProgress in Structured Streaming. >>>>> However, I realised that we also have other approaches in the current >>>>> codebase. There look >>>>> four approaches to deal with Java specifics in general: >>>>> >>>>> 1. Java specific classes such as JavaRDD and JavaSparkContext. >>>>> 2. Java specific methods with the same name that overload its >>>>> parameters, see functions.scala. >>>>> 3. Java specific methods with a different name that needs to >>>>> return a different type such as TaskContext.resourcesJMap vs >>>>> TaskContext.resources. >>>>> 4. One method that returns a Java instance for both Scala and Java >>>>> sides. see StateOperatorProgress and StreamingQueryProgress. >>>>> >>>>> >>>>> *Analysis on the current codebase:* >>>>> >>>>> I agree with 2. approach because the corresponding cases give you a >>>>> consistent API usage across >>>>> other language APIs in general. Approach 1. is from the old world when >>>>> we didn't have unified APIs. >>>>> This might be the worst approach. >>>>> >>>>> 3. and 4. are controversial. >>>>> >>>>> For 3., if you have to use Java APIs, then, you should search if there >>>>> is a variant of that API >>>>> every time specifically for Java APIs. But yes, it gives you >>>>> Java/Scala friendly instances. >>>>> >>>>> For 4., having one API that returns a Java instance makes you able to >>>>> use it in both Scala and Java APIs >>>>> sides although it makes you call asScala in Scala side specifically. >>>>> But you don’t >>>>> have to search if there’s a variant of this API and it gives you a >>>>> consistent API usage across languages. >>>>> >>>>> Also, note that calling Java in Scala is legitimate but the opposite >>>>> case is not, up to my best knowledge. >>>>> In addition, you should have a method that returns a Java instance for >>>>> PySpark or SparkR to support. >>>>> >>>>> >>>>> *Proposal:* >>>>> >>>>> I would like to have a general guidance on this that the Spark dev >>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all >>>>> cost. >>>>> >>>>> Note that this isn't a hard requirement but *a general guidance*; >>>>> therefore, the decision might be up to >>>>> the specific context. For example, when there are some strong >>>>> arguments to have a separate Java specific API, that’s fine. >>>>> Of course, we won’t change the existing methods given Micheal’s rubric >>>>> added before. I am talking about new >>>>> methods in unreleased branches. >>>>> >>>>> Any concern or opinion on this? >>>>> >>>> > > -- > Ryan Blue > Software Engineer > Netflix >