Scala users are arguably more prevailing compared to Java users, yes.
Using the Java instances in Scala side is legitimate, and they are already
being used in multiple please. I don't believe Scala
users find this not Scala friendly as it's legitimate and already being
used. I personally find it's more trouble some to let Java
users to search which APIs to call. Yes, I understand the pros and cons -
we should also find the balance considering the actual usage.

One more argument from me is, though, I think one of the goals in Spark
APIs is the unified API set up to my knowledge
 e.g., JavaRDD <> RDD vs DataFrame.
If either way is not particularly preferred over the other, I would just
choose the one to have the unified API set.



2020년 4월 27일 (월) 오후 10:37, Tom Graves <tgraves...@yahoo.com>님이 작성:

> I agree a general guidance is good so we keep consistent in the apis. I
> don't necessarily agree that 4 is the best solution though.  I agree its
> nice to have one api, but it is less friendly for the scala side.
> Searching for the equivalent Java api shouldn't be hard as it should be
> very close in the name and if we make it a general rule users should
> understand it.   I guess one good question is what API do most of our users
> use between Java and Scala and what is the ratio?  I don't know the answer
> to that. I've seen more using Scala over Java.  If the majority use Scala
> then I think the API should be more friendly to that.
>
> Tom
>
> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> gurwls...@gmail.com> wrote:
>
>
> Hi all,
>
> I would like to discuss Java specific APIs and which design we will choose.
> This has been discussed in multiple places so far, for example, at
> https://github.com/apache/spark/pull/28085#discussion_r407334754
>
>
> *The problem:*
>
> In short, I would like us to have clear guidance on how we support Java
> specific APIs when
> it requires to return a Java instance. The problem is simple:
>
> def requests: Map[String, ExecutorResourceRequest] = ...
> def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...
>
> vs
>
> def requests: java.util.Map[String, ExecutorResourceRequest] = ...
>
>
> *Current codebase:*
>
> My understanding so far was that the latter is preferred and more
> consistent and prevailing in the
> existing codebase, for example, see StateOperatorProgress and
> StreamingQueryProgress in Structured Streaming.
> However, I realised that we also have other approaches in the current
> codebase. There look
> four approaches to deal with Java specifics in general:
>
>    1. Java specific classes such as JavaRDD and JavaSparkContext.
>    2. Java specific methods with the same name that overload its
>    parameters, see functions.scala.
>    3. Java specific methods with a different name that needs to return a
>    different type such as TaskContext.resourcesJMap vs
>    TaskContext.resources.
>    4. One method that returns a Java instance for both Scala and Java
>    sides. see StateOperatorProgress and StreamingQueryProgress.
>
>
> *Analysis on the current codebase:*
>
> I agree with 2. approach because the corresponding cases give you a
> consistent API usage across
> other language APIs in general. Approach 1. is from the old world when we
> didn't have unified APIs.
> This might be the worst approach.
>
> 3. and 4. are controversial.
>
> For 3., if you have to use Java APIs, then, you should search if there is
> a variant of that API
> every time specifically for Java APIs. But yes, it gives you Java/Scala
> friendly instances.
>
> For 4., having one API that returns a Java instance makes you able to use
> it in both Scala and Java APIs
> sides although it makes you call asScala in Scala side specifically. But
> you don’t
> have to search if there’s a variant of this API and it gives you a
> consistent API usage across languages.
>
> Also, note that calling Java in Scala is legitimate but the opposite case
> is not, up to my best knowledge.
> In addition, you should have a method that returns a Java instance for
> PySpark or SparkR to support.
>
>
> *Proposal:*
>
> I would like to have a general guidance on this that the Spark dev agrees
> upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.
>
> Note that this isn't a hard requirement but *a general guidance*;
> therefore, the decision might be up to
> the specific context. For example, when there are some strong arguments to
> have a separate Java specific API, that’s fine.
> Of course, we won’t change the existing methods given Micheal’s rubric
> added before. I am talking about new
> methods in unreleased branches.
>
> Any concern or opinion on this?
>

Reply via email to