I'm still a Spark newbie, but I have a heavy background in languages and compilers... so take this with a barrel of salt...
Scala, to me, is the heart and soul of Spark. Couldn't work without it. Procedural languages like Python, Java, and all the rest are lovely when you have a couple of processors, but it doesn't scale. (pun intended) It's the same reason they had to invent a slew of 'Shader' languages for GPU programming. In fact, that's how I see Scala, as the "CUDA" or "GLSL" of cluster computing. Now, Scala isn't perfect. It could learn a thing or two from OCCAM about interprocess communication. (And from node.js about package management.) But functional programming becomes essential for highly-parallel code because the primary difference is that functional declares _what_ you want to do, and procedural declares _how_ you want to do it. Since you rarely know the shape of the cluster/graph ahead of time, functional programming becomes the superior paradigm, especially for the "outermost" parts of the program that interface with the scheduler. Python might be fine for the granular fragments, but you would have to export all those independent functions somehow, and define the scheduling and connective structure (the DAG) elsewhere, in yet another language or library. To fit neatly into GraphX, Python would probably have to be warped in the same way that GLSL is a stricter sub-set of C. You'd probably lose everything you like about the language, in order to make it seamless. I'm pretty agnostic about the whole Spark stack, and it's components, (eg: every time I run sbt/sbt assemble, Stuart Feldman dies a little inside and I get time to write another long email) but Scala is the one thing that gives it legs. I wish the rest of Spark was more like it. (ie: 'no ceremony') Scala might seem 'weird', but that's because it directly exposes parallelism, and the ways to cope with it. I've done enough distributed programming that the advantages are obvious, for that domain. You're not being asked to re-wire your thinking for Scala's benefit, but to solve the underlying problem. (But you are still being asked to turn your thinking sideways, I will admit.) People love Python because it 'fit' it's intended domain perfectly. That doesn't mean you'll love it just as much for embedded hardware, or GPU shader development, or Telecoms, or Spark. Then again, give me another week with the language, and see what I'm screaming about then ;-) On Thu, Jun 5, 2014 at 10:21 AM, John Omernik <j...@omernik.com> wrote: > Thank you for the response. If it helps at all: I demoed the Spark > platform for our data science team today. The idea of moving code from > batch testing, to Machine Learning systems, GraphX, and then to near-real > time models with streaming was cheered by the team as an efficiency they > would love. That said, most folks, on our team are Python junkies, and > they love that Spark seems to be committing to Python, and would REALLY > love to see Python in Streaming, it would feel complete for them from a > platform standpoint. It is still awesome using Scala, and many will learn > that, but that full Python integration/support, if possible, would be a > home run. > > > > > On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> We are definitely investigating a Python API for Streaming, but no >> announced deadline at this point. >> >> Matei >> >> On Jun 4, 2014, at 5:02 PM, John Omernik <j...@omernik.com> wrote: >> >> So Python is used in many of the Spark Ecosystem products, but not >> Streaming at this point. Is there a roadmap to include Python APIs in Spark >> Streaming? Anytime frame on this? >> >> Thanks! >> >> John >> >> >> On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> Quite a few people ask this question and the answer is pretty simple. >>> When we started Spark, we had two goals — we wanted to work with the Hadoop >>> ecosystem, which is JVM-based, and we wanted a concise programming >>> interface similar to Microsoft’s DryadLINQ (the first language-integrated >>> big data framework I know of, that begat things like FlumeJava and Crunch). >>> On the JVM, the only language that would offer that kind of API was Scala, >>> due to its ability to capture functions and ship them across the network. >>> Scala’s static typing also made it much easier to control performance >>> compared to, say, Jython or Groovy. >>> >>> In terms of usage, however, we see substantial usage of our other >>> languages (Java and Python), and we’re continuing to invest in both. In a >>> user survey we did last fall, about 25% of users used Java and 30% used >>> Python, and I imagine these numbers are growing. With lambda expressions >>> now added to Java 8 ( >>> http://databricks.com/blog/2014/04/14/Spark-with-Java-8.html), I think >>> we’ll see a lot more Java. And at Databricks I’ve seen a lot of interest in >>> Python, which is very exciting to us in terms of ease of use. >>> >>> Matei >>> >>> On May 29, 2014, at 1:57 PM, Benjamin Black <b...@b3k.us> wrote: >>> >>> HN is a cesspool safely ignored. >>> >>> >>> On Thu, May 29, 2014 at 1:55 PM, Nick Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> I recently discovered Hacker News and started reading through older >>>> posts about Scala >>>> <https://hn.algolia.com/?q=scala#!/story/forever/0/scala>. It looks >>>> like the language is fairly controversial on there, and it got me thinking. >>>> >>>> Scala appears to be the preferred language to work with in Spark, and >>>> Spark itself is written in Scala, right? >>>> >>>> I know that often times a successful project evolves gradually out of >>>> something small, and that the choice of programming language may not always >>>> have been made consciously at the outset. >>>> >>>> But pretending that it was, why is Scala the preferred language of >>>> Spark? >>>> >>>> Nick >>>> >>>> >>>> ------------------------------ >>>> View this message in context: Why Scala? >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html> >>>> Sent from the Apache Spark User List mailing list archive >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com >>>> <http://nabble.com/>. >>>> >>> >>> >>> >> >> > -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers