On Java vs Scala: Sean's right that behind the scenes you'll be calling JVM based APIs anyway (e.g. sun.misc.unsafe for Tungsten) and that the vast majority of Apache Spark's important logic is written in Scala.
Would be an interesting experiment to write the same functioning program using the Java APIs vs Scala APIs just to see if there is a noticeable difference: I'm thinking in terms of how the Scala implementation libraries perform at runtime, with profiling (we use Healthcenter, tprof, or just microbenchmarking with prints and timers), we've seen lots of code in Scala itself to do with (un)boxing and instanceOf checks that could do with some TLC for performance. Now quite outdated but still shows that writing what's concise (Scala) isn't always best for performance: https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html So if we just to stick to Java we may not hit those overheads as often (there's a talk by my colleague on boosting performance from a Java implementer's perspective at https://www.youtube.com/watch?v=rcVTM-71bZk), but I don't expect the differences to be enormous. Full disclosure that I work for IBM and one of our goals is to improve Apache Spark and our Java implementation to perform fast together. There's also the obvious trade-off of developer productivity and code maintainability (more Java devs than Scala devs), so my suggestion is that if you're much better at writing Java or Scala code, use that for what is considered the real important performance critical logic - be aware that you're going be hitting the Apache Spark codebase written in Scala anyway, so there's only so much to be gained here. I also think that Just in Time Compiler implementations are generally better at optimising what's written as Java code instead of Scala code as knowing the types way ahead of time and where we can make codepath shortcuts in the bytecode execution should deliver a slight performance improvements. I am keen to come up with some solid recommendations based on evidence for us all to benefit from. From: Aseem Bansal <asmbans...@gmail.com> To: ayan guha <guha.a...@gmail.com> Cc: Sean Owen <so...@cloudera.com>, user <user@spark.apache.org> Date: 01/09/2016 13:11 Subject: Re: Spark 2.0.0 - Java vs Scala performance difference there is already a mail thread for scala vs python. check the archives On Thu, Sep 1, 2016 at 5:18 PM, ayan guha <guha.a...@gmail.com> wrote: How about Scala vs Python? On Thu, Sep 1, 2016 at 7:27 PM, Sean Owen <so...@cloudera.com> wrote: I can't think of a situation where it would be materially different. Both are using the JVM-based APIs directly. Here and there there's a tiny bit of overhead in using the Java APIs because something is translated from a Java-style object to a Scala-style object, but this is generally trivial. On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal <asmbans...@gmail.com> wrote: > Hi > > Would there be any significant performance difference when using Java vs. > Scala API? --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org -- Best Regards, Ayan Guha Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU