Just a quick (or maybe not :) ) question... Given the tight coupling to the Apache Spark project, were there any considerations or discussions with the Spark community regarding including the Spark-Kernel functionality outright in Spark, or the possibility of becoming a subproject?
I'm just curious. I don't think an answer one way or another would necessarily block incubation. -Taylor > On Nov 12, 2015, at 7:17 PM, da...@fallside.com wrote: > > Hello, we would like to start a discussion on accepting the Spark-Kernel, > a mechanism for applications to interactively and remotely access Apache > Spark, into the Apache Incubator. > > The proposal is available online at > https://wiki.apache.org/incubator/SparkKernelProposal, and it is appended > to this email. > > We are looking for additional mentors to help with this project, and we > would much appreciate your guidance and advice. > > Thank-you in advance, > David Fallside > > > > = Spark-Kernel Proposal = > > == Abstract == > Spark-Kernel provides applications with a mechanism to interactively and > remotely access Apache Spark. > > == Proposal == > The Spark-Kernel enables interactive applications to access Apache Spark > clusters. More specifically: > * Applications can send code-snippets and libraries for execution by Spark > * Applications can be deployed separately from Spark clusters and > communicate with the Spark-Kernel using the provided Spark-Kernel client > * Execution results and streaming data can be sent back to calling > applications > * Applications no longer have to be network connected to the workers on a > Spark cluster because the Spark-Kernel acts as each application’s proxy > * Work has started on enabling Spark-Kernel to support languages in > addition to Scala, namely Python (with PySpark), R (with SparkR), and SQL > (with SparkSQL) > > == Background & Rationale == > Apache Spark provides applications with a fast and general purpose > distributed computing engine that supports static and streaming data, > tabular and graph representations of data, and an extensive library of > machine learning libraries. Consequently, a wide variety of applications > will be written for Spark and there will be interactive applications that > require relatively frequent function evaluations, and batch-oriented > applications that require one-shot or only occasional evaluation. > > Apache Spark provides two mechanisms for applications to connect with > Spark. The primary mechanism launches applications on Spark clusters using > spark-submit > (http://spark.apache.org/docs/latest/submitting-applications.html); this > requires developers to bundle their application code plus any dependencies > into JAR files, and then submit them to Spark. A second mechanism is an > ODBC/JDBC API > (http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine) > which enables applications to issue SQL queries against SparkSQL. > > Our experience when developing interactive applications, such as analytic > applications and Jupyter Notebooks, to run against Spark was that the > spark-submit mechanism was overly cumbersome and slow (requiring JAR > creation and forking processes to run spark-submit), and the SQL interface > was too limiting and did not offer easy access to components other than > SparkSQL, such as streaming. The most promising mechanism provided by > Apache Spark was the command-line shell > (http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) > which enabled us to execute code snippets and dynamically control the > tasks submitted to a Spark cluster. Spark does not provide the > command-line shell as a consumable service but it provided us with the > starting point from which we developed the Spark-Kernel. > > == Current Status == > Spark-Kernel was first developed by a small team working on an > internal-IBM Spark-related project in July 2014. In recognition of its > likely general utility to Spark users and developers, in November 2014 the > Spark-Kernel project was moved to GitHub and made available under the > Apache License V2. > > == Meritocracy == > The current developers are familiar with the meritocratic open source > development process at Apache. As the project has gathered interest at > GitHub the developers have actively started a process to invite additional > developers into the project, and we have at least one new developer who is > ready to contribute code to the project. > > == Community == > We started building a community around the Spark-Kernel project when we > moved it to GitHub about one year ago. Since then we have grown to about > 70 people, and there are regular requests and suggestions from the > community. We believe that providing Apache Spark application developers > with a general-purpose and interactive API holds a lot of community > potential, especially considering possible tie-in’s with the Jupyter and > data science community. > > == Core Developers == > The core developers of the project are currently all from IBM, from the > IBM Emerging Technology team and from IBM’s recently formed Spark > Technology Center. > > == Alignment == > Apache, as the home of Apache Spark, is the most natural home for the > Spark-Kernel project because it was designed to work with Apache Spark and > to provide capabilities for interactive applications and data science > tools not provided by Spark itself. > > The Spark-Kernel also has an affinity with Jupyter (jupyter.org) because > it uses the Jupyter protocol for communications, and so Jupyter Notebooks > can directly use the Spark-Kernel as a kernel for communicating with > Apache Spark. However, we believe that the Spark-Kernel provides a > general-purpose mechanism enabling a wider variety of applications than > just Notebooks to access Spark, and so the Spark-Kernel’s greatest > affinity is with Apache and Apache Spark. > > == Known Risks == > === Orphaned products === > We believe the Spark-Kernel project has a low-risk of abandonment due to > interest in its continuing existence from several parties. More > specifically, the Spark-Kernel provides a capability that is not provided > by Apache Spark today but it enables a wider range of applications to > leverage Spark. For example, IBM uses (and is considering) the > Spark-Kernel in several offerings including its IBM Analytics for Apache > Spark product in the Bluemix Cloud. There are also a couple of other > commercial users who are using or considering its use in their offerings. > Furthermore, Jupyter Notebooks are used by data scientists and Spark is > gaining popularity as an analytic engine for them. Jupyter Notebooks are > very easily enabled with the Spark-Kernel and so there is another > constituency for it. > > === Inexperience with Open Source === > The Spark-Kernel project has been running as an open-source project > (albeit with only IBM committers) for the past several months. The project > has an active issue tracker and due to the interest indicated by the > nature and volume of requests and comments, the team has publicly stated > it is beginning to build a process so they can accept third-party > contributions to the project. > > === Relationships with Other Apache Products === > The Spark-Kernel has a clear affinity with the Apache Spark project > because it is designed to provide capabilities for interactive > applications and data science tools not provided by Spark itself. The > Spark-Kernel can be a back-end for the Zeppelin project currently > incubating at Apache. There is interest from the Spark-Kernel community to > develop this capability and an experimental branch has been started. > > === Homogeneous Developers === > The current group of developers working on Spark-Kernel are all from IBM > although the group is in the process of expanding its membership to > include members of the GitHub community who are not from IBM and who have > been active in the Spark-Kernel community in GutHub. > > === Reliance on Salaried Developers === > The initial committers are full-time employees at IBM although not all > work on the project full-time. > > === Excessive Fascination with the Apache Brand === > We believe the Spark-Kernel benefits Apache Spark application developers, > and we are interested in an Apache Spark-Kernel project to benefit these > developers by engaging a larger community, facilitating closer ties with > the existing Spark project, and yes, gaining more visibility for the > Spark-Kernel as a solution. > > We have recently become aware that the project name “Spark-Kernel” may be > interpreted as having an association with an Apache project. If the > project is accepted by Apache, we suggest the project name remains the > same, but otherwise we will change it to one that does not imply any > Apache association. > > === Documentation === > Comprehensive documentation including “Getting Started”, API > specifications and a Roadmap are available from the GitHub project, see > https://github.com/ibm-et/spark-kernel/wiki. > > === Initial Source === > The source code resides at https://github.com/ibm-et/spark-kernel. > > === External Dependencies === > The Spark-Kernel depends upon a number of Apache projects: > * Spark > * Hadoop > * Ivy > * Commons > > The Spark-Kernel also depends upon a number of other open source projects: > * JeroMQ (LGPL with Static Linking Exception, > http://zeromq.org/area:licensing) > * Akka (MIT) > * JOpt Simple (MIT) > * Spring Framework Core (Apache v2) > * Play (Apache v2) > * SLF4J (MIT) > * Scala > * Scalatest (Apache v2) > * Scalactic (Apache v2) > * Mockito (MIT) > > == Required Resources == > Developer and user mailing lists > * priv...@spark-kernel.incubator.apache.org (with moderated subscriptions) > * comm...@spark-kernel.incubator.apache.org > * d...@spark-kernel.incubator.apache.org > * us...@spark-kernel.incubator.apache.org > > A git repository: > https://git-wip-us.apache.org/repos/asf/incubator-spark-kernel.git > > A JIRA issue tracker: https://issues.apache.org/jira/browse/SPARK-KERNEL > > == Initial Committers == > The initial list of committers is: > * Leugim Bustelo (g...@bustelos.com) > * Jakob Odersky (joder...@gmail.com) > * Luciano Resende (lrese...@apache.org) > * Robert Senkbeil (chip.senkb...@gmail.com) > * Corey Stubbs (cas5...@gmail.com) > * Miao Wang (wm...@hotmail.com) > * Sean Welleck (welle...@gmail.com) > > === Affiliations === > All of the initial committers are employed by IBM. > > == Sponsors == > === Champion === > * Sam Ruby (IBM) > > === Nominated Mentors === > * Luciano Resende > > We wish to recruit additional mentors during incubation. > > === Sponsoring Entity === > The Apache Incubator. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org