Re: [DISCUSS] Java specific APIs design concern and choice

Hyukjin Kwon Mon, 27 Apr 2020 16:35:10 -0700

> One thing we could do here is use Java collections internally and make
the Scala API a thin wrapper around Java -- like how Python works.
> Then adding a method to the Scala API would require adding it to the Java
API and we would keep the two more in sync.


I think it can be an appropriate idea for when we have to deal with this
case a lot but I don't think there are so many
user-facing APIs to return a Java collections, it's rather rare. Also, the
Java users are relatively less than Scala users.
This case is slightly different from Python in a way that there are so many
differences to deal with in PySpark case.

Also, in case of `Seq`, actually we can just use `Array` instead for both
Scala and Java side simply. I don't find such cases notably awkward.
This problematic cases might be specific to few Java collections or
instances, and I would like to avoid an overkill here.

Of course, if there is a place to consider other options, let's do. I don't
like to say this is the only required option.





2020년 4월 28일 (화) 오전 1:18, Ryan Blue <rb...@netflix.com.invalid>님이 작성:

> I think the right choice here depends on how the object is used. For
> developer and internal APIs, I think standardizing on Java collections
> makes the most sense.
>
> For user-facing APIs, it is awkward to return Java collections to Scala
> code -- I think that's the motivation for Tom's comment. For user APIs, I
> think most methods should return Scala collections, and I don't have a
> strong opinion about whether the conversion (or lack thereof) is done in a
> separate object (#1) or in parallel methods (#3).
>
> Both #1 and #3 seem like about the same amount of work and have the same
> likelihood that a developer will leave out a Java method version. One thing
> we could do here is use Java collections internally and make the Scala API
> a thin wrapper around Java -- like how Python works. Then adding a method
> to the Scala API would require adding it to the Java API and we would keep
> the two more in sync. It would also help avoid Scala collections leaking
> into internals.
>
> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> Let's stick to the less maintenance efforts then rather than we leave it
>> undecided and delay with leaving this inconsistency.
>>
>> I dont think we can have some very meaningful data about this soon given
>> that we don't hear much complaints about this in general so far.
>>
>> The point of this thread is to make a call rather then defer to the
>> future.
>>
>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <cloud0...@gmail.com> wrote:
>>
>>> IIUC We are moving away from having 2 classes for Java and Scala, like
>>> JavaRDD and RDD. It's much simpler to maintain and use with a single class.
>>>
>>> I don't have a strong preference over option 3 or 4. We may need to
>>> collect more data points from actual users.
>>>
>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> Scala users are arguably more prevailing compared to Java users, yes.
>>>> Using the Java instances in Scala side is legitimate, and they are
>>>> already being used in multiple please. I don't believe Scala
>>>> users find this not Scala friendly as it's legitimate and already being
>>>> used. I personally find it's more trouble some to let Java
>>>> users to search which APIs to call. Yes, I understand the pros and cons
>>>> - we should also find the balance considering the actual usage.
>>>>
>>>> One more argument from me is, though, I think one of the goals in Spark
>>>> APIs is the unified API set up to my knowledge
>>>>  e.g., JavaRDD <> RDD vs DataFrame.
>>>> If either way is not particularly preferred over the other, I would
>>>> just choose the one to have the unified API set.
>>>>
>>>>
>>>>
>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성:
>>>>
>>>>> I agree a general guidance is good so we keep consistent in the apis.
>>>>> I don't necessarily agree that 4 is the best solution though.  I agree its
>>>>> nice to have one api, but it is less friendly for the scala side.
>>>>> Searching for the equivalent Java api shouldn't be hard as it should be
>>>>> very close in the name and if we make it a general rule users should
>>>>> understand it.   I guess one good question is what API do most of our 
>>>>> users
>>>>> use between Java and Scala and what is the ratio?  I don't know the answer
>>>>> to that. I've seen more using Scala over Java.  If the majority use Scala
>>>>> then I think the API should be more friendly to that.
>>>>>
>>>>> Tom
>>>>>
>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
>>>>> gurwls...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I would like to discuss Java specific APIs and which design we will
>>>>> choose.
>>>>> This has been discussed in multiple places so far, for example, at
>>>>> https://github.com/apache/spark/pull/28085#discussion_r407334754
>>>>>
>>>>>
>>>>> *The problem:*
>>>>>
>>>>> In short, I would like us to have clear guidance on how we support
>>>>> Java specific APIs when
>>>>> it requires to return a Java instance. The problem is simple:
>>>>>
>>>>> def requests: Map[String, ExecutorResourceRequest] = ...
>>>>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>>>>>
>>>>> vs
>>>>>
>>>>> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>>>>>
>>>>>
>>>>> *Current codebase:*
>>>>>
>>>>> My understanding so far was that the latter is preferred and more
>>>>> consistent and prevailing in the
>>>>> existing codebase, for example, see StateOperatorProgress and
>>>>> StreamingQueryProgress in Structured Streaming.
>>>>> However, I realised that we also have other approaches in the current
>>>>> codebase. There look
>>>>> four approaches to deal with Java specifics in general:
>>>>>
>>>>>    1. Java specific classes such as JavaRDD and JavaSparkContext.
>>>>>    2. Java specific methods with the same name that overload its
>>>>>    parameters, see functions.scala.
>>>>>    3. Java specific methods with a different name that needs to
>>>>>    return a different type such as TaskContext.resourcesJMap vs
>>>>>    TaskContext.resources.
>>>>>    4. One method that returns a Java instance for both Scala and Java
>>>>>    sides. see StateOperatorProgress and StreamingQueryProgress.
>>>>>
>>>>>
>>>>> *Analysis on the current codebase:*
>>>>>
>>>>> I agree with 2. approach because the corresponding cases give you a
>>>>> consistent API usage across
>>>>> other language APIs in general. Approach 1. is from the old world when
>>>>> we didn't have unified APIs.
>>>>> This might be the worst approach.
>>>>>
>>>>> 3. and 4. are controversial.
>>>>>
>>>>> For 3., if you have to use Java APIs, then, you should search if there
>>>>> is a variant of that API
>>>>> every time specifically for Java APIs. But yes, it gives you
>>>>> Java/Scala friendly instances.
>>>>>
>>>>> For 4., having one API that returns a Java instance makes you able to
>>>>> use it in both Scala and Java APIs
>>>>> sides although it makes you call asScala in Scala side specifically.
>>>>> But you don’t
>>>>> have to search if there’s a variant of this API and it gives you a
>>>>> consistent API usage across languages.
>>>>>
>>>>> Also, note that calling Java in Scala is legitimate but the opposite
>>>>> case is not, up to my best knowledge.
>>>>> In addition, you should have a method that returns a Java instance for
>>>>> PySpark or SparkR to support.
>>>>>
>>>>>
>>>>> *Proposal:*
>>>>>
>>>>> I would like to have a general guidance on this that the Spark dev
>>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all
>>>>> cost.
>>>>>
>>>>> Note that this isn't a hard requirement but *a general guidance*;
>>>>> therefore, the decision might be up to
>>>>> the specific context. For example, when there are some strong
>>>>> arguments to have a separate Java specific API, that’s fine.
>>>>> Of course, we won’t change the existing methods given Micheal’s rubric
>>>>> added before. I am talking about new
>>>>> methods in unreleased branches.
>>>>>
>>>>> Any concern or opinion on this?
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Java specific APIs design concern and choice

Reply via email to