Re: [DISCUSS] Java specific APIs design concern and choice

ZHANG Wei Mon, 27 Apr 2020 19:36:54 -0700

How about making a small change on option 4:
  Keep Scala API returning Scala type instance with providing a
  `asJava` method to return a Java type instance.


Scala 2.13 has provided CollectionConverter [1][2][3], in the following
Spark dependences upgrade, which can be supported by nature. For
current Scala 2.12 version, we can wrap `ImplicitConversionsToJava`[4]
as what Scala 2.13 does and add implicit conversions.

Just my 2 cents.

-- 
Cheers,
-z

[1] 
https://docs.scala-lang.org/overviews/collections-2.13/conversions-between-java-and-scala-collections.html
[2] 
https://www.scala-lang.org/api/2.13.0/scala/jdk/javaapi/CollectionConverters$.html
[3] https://www.scala-lang.org/api/2.13.0/scala/jdk/CollectionConverters$.html
[4] 
https://www.scala-lang.org/api/2.12.11/scala/collection/convert/ImplicitConversionsToJava$.html


On Tue, 28 Apr 2020 08:52:57 +0900
Hyukjin Kwon <[email protected]> wrote:

> I would like to make sure I am open for other options that can be
> considered situationally and based on the context.
> It's okay, and I don't target to restrict this here. For example, DSv2, I
> understand it's written in Java because Java
> interfaces arguably brings better performance. That's why vectorized
> readers are written in Java too.
> 
> Maybe the "general" wasn't explicit in my previous email. Adding APIs to
> return a Java instance is still
> rather rare in general given my few years monitoring.
> The problem I would more like to deal with is more about when we need to
> add one or a couple of user-facing
> Java-specific APIs to return Java instances, which is relatively more
> frequent compared to when we need a bunch
> of Java specific APIs.
> 
> In this case, I think it should be guided to use 4. approach. There are
> pros and cons between 3. and 4., of course.
> But it looks to me 4. approach is closer to what Spark has targeted so far.
> 
> 
> 
> 2020년 4월 28일 (화) 오전 8:34, Hyukjin Kwon <[email protected]>님이 작성:
> 
> > > One thing we could do here is use Java collections internally and make
> > the Scala API a thin wrapper around Java -- like how Python works.
> > > Then adding a method to the Scala API would require adding it to the
> > Java API and we would keep the two more in sync.
> >
> > I think it can be an appropriate idea for when we have to deal with this
> > case a lot but I don't think there are so many
> > user-facing APIs to return a Java collections, it's rather rare. Also, the
> > Java users are relatively less than Scala users.
> > This case is slightly different from Python in a way that there are so
> > many differences to deal with in PySpark case.
> >
> > Also, in case of `Seq`, actually we can just use `Array` instead for both
> > Scala and Java side simply. I don't find such cases notably awkward.
> > This problematic cases might be specific to few Java collections or
> > instances, and I would like to avoid an overkill here.
> >
> > Of course, if there is a place to consider other options, let's do. I
> > don't like to say this is the only required option.
> >
> >
> >
> >
> >
> > 2020년 4월 28일 (화) 오전 1:18, Ryan Blue <[email protected]>님이 작성:
> >
> >> I think the right choice here depends on how the object is used. For
> >> developer and internal APIs, I think standardizing on Java collections
> >> makes the most sense.
> >>
> >> For user-facing APIs, it is awkward to return Java collections to Scala
> >> code -- I think that's the motivation for Tom's comment. For user APIs, I
> >> think most methods should return Scala collections, and I don't have a
> >> strong opinion about whether the conversion (or lack thereof) is done in a
> >> separate object (#1) or in parallel methods (#3).
> >>
> >> Both #1 and #3 seem like about the same amount of work and have the same
> >> likelihood that a developer will leave out a Java method version. One thing
> >> we could do here is use Java collections internally and make the Scala API
> >> a thin wrapper around Java -- like how Python works. Then adding a method
> >> to the Scala API would require adding it to the Java API and we would keep
> >> the two more in sync. It would also help avoid Scala collections leaking
> >> into internals.
> >>
> >> On Mon, Apr 27, 2020 at 8:49 AM Hyukjin Kwon <[email protected]> wrote:
> >>
> >>> Let's stick to the less maintenance efforts then rather than we leave it
> >>> undecided and delay with leaving this inconsistency.
> >>>
> >>> I dont think we can have some very meaningful data about this soon given
> >>> that we don't hear much complaints about this in general so far.
> >>>
> >>> The point of this thread is to make a call rather then defer to the
> >>> future.
> >>>
> >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan, <[email protected]> wrote:
> >>>
> >>>> IIUC We are moving away from having 2 classes for Java and Scala, like
> >>>> JavaRDD and RDD. It's much simpler to maintain and use with a single 
> >>>> class.
> >>>>
> >>>> I don't have a strong preference over option 3 or 4. We may need to
> >>>> collect more data points from actual users.
> >>>>
> >>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Scala users are arguably more prevailing compared to Java users, yes.
> >>>>> Using the Java instances in Scala side is legitimate, and they are
> >>>>> already being used in multiple please. I don't believe Scala
> >>>>> users find this not Scala friendly as it's legitimate and already
> >>>>> being used. I personally find it's more trouble some to let Java
> >>>>> users to search which APIs to call. Yes, I understand the pros and
> >>>>> cons - we should also find the balance considering the actual usage.
> >>>>>
> >>>>> One more argument from me is, though, I think one of the goals in
> >>>>> Spark APIs is the unified API set up to my knowledge
> >>>>>  e.g., JavaRDD <> RDD vs DataFrame.
> >>>>> If either way is not particularly preferred over the other, I would
> >>>>> just choose the one to have the unified API set.
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves <[email protected]>님이 작성:
> >>>>>
> >>>>>> I agree a general guidance is good so we keep consistent in the apis.
> >>>>>> I don't necessarily agree that 4 is the best solution though.  I agree 
> >>>>>> its
> >>>>>> nice to have one api, but it is less friendly for the scala side.
> >>>>>> Searching for the equivalent Java api shouldn't be hard as it should be
> >>>>>> very close in the name and if we make it a general rule users should
> >>>>>> understand it.   I guess one good question is what API do most of our 
> >>>>>> users
> >>>>>> use between Java and Scala and what is the ratio?  I don't know the 
> >>>>>> answer
> >>>>>> to that. I've seen more using Scala over Java.  If the majority use 
> >>>>>> Scala
> >>>>>> then I think the API should be more friendly to that.
> >>>>>>
> >>>>>> Tom
> >>>>>>
> >>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I would like to discuss Java specific APIs and which design we will
> >>>>>> choose.
> >>>>>> This has been discussed in multiple places so far, for example, at
> >>>>>> https://github.com/apache/spark/pull/28085#discussion_r407334754
> >>>>>>
> >>>>>>
> >>>>>> *The problem:*
> >>>>>>
> >>>>>> In short, I would like us to have clear guidance on how we support
> >>>>>> Java specific APIs when
> >>>>>> it requires to return a Java instance. The problem is simple:
> >>>>>>
> >>>>>> def requests: Map[String, ExecutorResourceRequest] = ...
> >>>>>> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
> >>>>>>
> >>>>>> vs
> >>>>>>
> >>>>>> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
> >>>>>>
> >>>>>>
> >>>>>> *Current codebase:*
> >>>>>>
> >>>>>> My understanding so far was that the latter is preferred and more
> >>>>>> consistent and prevailing in the
> >>>>>> existing codebase, for example, see StateOperatorProgress and
> >>>>>> StreamingQueryProgress in Structured Streaming.
> >>>>>> However, I realised that we also have other approaches in the current
> >>>>>> codebase. There look
> >>>>>> four approaches to deal with Java specifics in general:
> >>>>>>
> >>>>>>    1. Java specific classes such as JavaRDD and JavaSparkContext.
> >>>>>>    2. Java specific methods with the same name that overload its
> >>>>>>    parameters, see functions.scala.
> >>>>>>    3. Java specific methods with a different name that needs to
> >>>>>>    return a different type such as TaskContext.resourcesJMap vs
> >>>>>>    TaskContext.resources.
> >>>>>>    4. One method that returns a Java instance for both Scala and
> >>>>>>    Java sides. see StateOperatorProgress and StreamingQueryProgress.
> >>>>>>
> >>>>>>
> >>>>>> *Analysis on the current codebase:*
> >>>>>>
> >>>>>> I agree with 2. approach because the corresponding cases give you a
> >>>>>> consistent API usage across
> >>>>>> other language APIs in general. Approach 1. is from the old world
> >>>>>> when we didn't have unified APIs.
> >>>>>> This might be the worst approach.
> >>>>>>
> >>>>>> 3. and 4. are controversial.
> >>>>>>
> >>>>>> For 3., if you have to use Java APIs, then, you should search if
> >>>>>> there is a variant of that API
> >>>>>> every time specifically for Java APIs. But yes, it gives you
> >>>>>> Java/Scala friendly instances.
> >>>>>>
> >>>>>> For 4., having one API that returns a Java instance makes you able to
> >>>>>> use it in both Scala and Java APIs
> >>>>>> sides although it makes you call asScala in Scala side specifically.
> >>>>>> But you don’t
> >>>>>> have to search if there’s a variant of this API and it gives you a
> >>>>>> consistent API usage across languages.
> >>>>>>
> >>>>>> Also, note that calling Java in Scala is legitimate but the opposite
> >>>>>> case is not, up to my best knowledge.
> >>>>>> In addition, you should have a method that returns a Java instance
> >>>>>> for PySpark or SparkR to support.
> >>>>>>
> >>>>>>
> >>>>>> *Proposal:*
> >>>>>>
> >>>>>> I would like to have a general guidance on this that the Spark dev
> >>>>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at 
> >>>>>> all
> >>>>>> cost.
> >>>>>>
> >>>>>> Note that this isn't a hard requirement but *a general guidance*;
> >>>>>> therefore, the decision might be up to
> >>>>>> the specific context. For example, when there are some strong
> >>>>>> arguments to have a separate Java specific API, that’s fine.
> >>>>>> Of course, we won’t change the existing methods given Micheal’s
> >>>>>> rubric added before. I am talking about new
> >>>>>> methods in unreleased branches.
> >>>>>>
> >>>>>> Any concern or opinion on this?
> >>>>>>
> >>>>>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [DISCUSS] Java specific APIs design concern and choice

Reply via email to