I agree a general guidance is good so we keep consistent in the apis. I don't 
necessarily agree that 4 is the best solution though.  I agree its nice to have 
one api, but it is less friendly for the scala side.  Searching for the 
equivalent Java api shouldn't be hard as it should be very close in the name 
and if we make it a general rule users should understand it.   I guess one good 
question is what API do most of our users use between Java and Scala and what 
is the ratio?  I don't know the answer to that. I've seen more using Scala over 
Java.  If the majority use Scala then I think the API should be more friendly 
to that.
Tom
    On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon 
<gurwls...@gmail.com> wrote:  
 
 
Hi all,

I would like to discuss Java specific APIs and which design we will choose.
This has been discussed in multiple places so far, for example, at
https://github.com/apache/spark/pull/28085#discussion_r407334754


The problem:

In short, I would like us to have clear guidance on how we support Java 
specific APIs when
it requires to return a Java instance. The problem is simple:
def requests: Map[String, ExecutorResourceRequest] = ...
def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...

vs
def requests: java.util.Map[String, ExecutorResourceRequest] = ...


Current codebase:

My understanding so far was that the latter is preferred and more consistent 
and prevailing in the
existing codebase, for example, see StateOperatorProgress and 
StreamingQueryProgress in Structured Streaming.
However, I realised that we also have other approaches in the current codebase. 
There look
four approaches to deal with Java specifics in general:
   
   - Java specific classes such as JavaRDD and JavaSparkContext.
   - Java specific methods with the same name that overload its parameters, see 
functions.scala.
   - Java specific methods with a different name that needs to return a 
different type such as TaskContext.resourcesJMap vs  TaskContext.resources.
   - One method that returns a Java instance for both Scala and Java sides. see 
StateOperatorProgress and StreamingQueryProgress.   



Analysis on the current codebase:

I agree with 2. approach because the corresponding cases give you a consistent 
API usage across
other language APIs in general. Approach 1. is from the old world when we 
didn't have unified APIs.
This might be the worst approach.

3. and 4. are controversial.

For 3., if you have to use Java APIs, then, you should search if there is a 
variant of that API
every time specifically for Java APIs. But yes, it gives you Java/Scala 
friendly instances.

For 4., having one API that returns a Java instance makes you able to use it in 
both Scala and Java APIs
sides although it makes you call asScala in Scala side specifically. But you 
don’t
have to search if there’s a variant of this API and it gives you a consistent 
API usage across languages.

Also, note that calling Java in Scala is legitimate but the opposite case is 
not, up to my best knowledge.
In addition, you should have a method that returns a Java instance for PySpark 
or SparkR to support.


Proposal:

I would like to have a general guidance on this that the Spark dev agrees upon: 
Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.

Note that this isn't a hard requirement but a general guidance; therefore, the 
decision might be up to
the specific context. For example, when there are some strong arguments to have 
a separate Java specific API, that’s fine.
Of course, we won’t change the existing methods given Micheal’s rubric added 
before. I am talking about new
methods in unreleased branches.

Any concern or opinion on this?
  

Reply via email to