Hi Taylor, I don't know the Spark community's opinion on the "outright vs subproject" issue, although I have told a couple people in that community about the proposal and have posted an FYI to the spark-dev list. From a technical perspective, Spark-Kernel mainly uses public Spark APIs (except for some SparkR usage, https://github.com/ibm-et/spark-kernel/blob/master/sparkr-interpreter/src/main/resources/README.md) and so I guess the answer could go either way depending on the Spark community. Thanks, David
> On November 12, 2015 at 8:05 PM "P. Taylor Goetz" <ptgo...@gmail.com> wrote: > > > Just a quick (or maybe not :) ) question... > > Given the tight coupling to the Apache Spark project, were there any > considerations or discussions with the Spark community regarding including the > Spark-Kernel functionality outright in Spark, or the possibility of becoming a > subproject? > > I'm just curious. I don't think an answer one way or another would necessarily > block incubation. > > -Taylor > > > On Nov 12, 2015, at 7:17 PM, da...@fallside.com wrote: > > > > Hello, we would like to start a discussion on accepting the Spark-Kernel, > > a mechanism for applications to interactively and remotely access Apache > > Spark, into the Apache Incubator. > > > > The proposal is available online at > > https://wiki.apache.org/incubator/SparkKernelProposal, and it is appended > > to this email. > > > > We are looking for additional mentors to help with this project, and we > > would much appreciate your guidance and advice. > > > > Thank-you in advance, > > David Fallside > > > > > > > > = Spark-Kernel Proposal = > > > > == Abstract == > > Spark-Kernel provides applications with a mechanism to interactively and > > remotely access Apache Spark. > > > > == Proposal == > > The Spark-Kernel enables interactive applications to access Apache Spark > > clusters. More specifically: > > * Applications can send code-snippets and libraries for execution by Spark > > * Applications can be deployed separately from Spark clusters and > > communicate with the Spark-Kernel using the provided Spark-Kernel client > > * Execution results and streaming data can be sent back to calling > > applications > > * Applications no longer have to be network connected to the workers on a > > Spark cluster because the Spark-Kernel acts as each application’s proxy > > * Work has started on enabling Spark-Kernel to support languages in > > addition to Scala, namely Python (with PySpark), R (with SparkR), and SQL > > (with SparkSQL) > > > > == Background & Rationale == > > Apache Spark provides applications with a fast and general purpose > > distributed computing engine that supports static and streaming data, > > tabular and graph representations of data, and an extensive library of > > machine learning libraries. Consequently, a wide variety of applications > > will be written for Spark and there will be interactive applications that > > require relatively frequent function evaluations, and batch-oriented > > applications that require one-shot or only occasional evaluation. > > > > Apache Spark provides two mechanisms for applications to connect with > > Spark. The primary mechanism launches applications on Spark clusters using > > spark-submit > > (http://spark.apache.org/docs/latest/submitting-applications.html); this > > requires developers to bundle their application code plus any dependencies > > into JAR files, and then submit them to Spark. A second mechanism is an > > ODBC/JDBC API > > (http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine) > > which enables applications to issue SQL queries against SparkSQL. > > > > Our experience when developing interactive applications, such as analytic > > applications and Jupyter Notebooks, to run against Spark was that the > > spark-submit mechanism was overly cumbersome and slow (requiring JAR > > creation and forking processes to run spark-submit), and the SQL interface > > was too limiting and did not offer easy access to components other than > > SparkSQL, such as streaming. The most promising mechanism provided by > > Apache Spark was the command-line shell > > (http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) > > which enabled us to execute code snippets and dynamically control the > > tasks submitted to a Spark cluster. Spark does not provide the > > command-line shell as a consumable service but it provided us with the > > starting point from which we developed the Spark-Kernel. > > > > == Current Status == > > Spark-Kernel was first developed by a small team working on an > > internal-IBM Spark-related project in July 2014. In recognition of its > > likely general utility to Spark users and developers, in November 2014 the > > Spark-Kernel project was moved to GitHub and made available under the > > Apache License V2. > > > > == Meritocracy == > > The current developers are familiar with the meritocratic open source > > development process at Apache. As the project has gathered interest at > > GitHub the developers have actively started a process to invite additional > > developers into the project, and we have at least one new developer who is > > ready to contribute code to the project. > > > > == Community == > > We started building a community around the Spark-Kernel project when we > > moved it to GitHub about one year ago. Since then we have grown to about > > 70 people, and there are regular requests and suggestions from the > > community. We believe that providing Apache Spark application developers > > with a general-purpose and interactive API holds a lot of community > > potential, especially considering possible tie-in’s with the Jupyter and > > data science community. > > > > == Core Developers == > > The core developers of the project are currently all from IBM, from the > > IBM Emerging Technology team and from IBM’s recently formed Spark > > Technology Center. > > > > == Alignment == > > Apache, as the home of Apache Spark, is the most natural home for the > > Spark-Kernel project because it was designed to work with Apache Spark and > > to provide capabilities for interactive applications and data science > > tools not provided by Spark itself. > > > > The Spark-Kernel also has an affinity with Jupyter (jupyter.org) because > > it uses the Jupyter protocol for communications, and so Jupyter Notebooks > > can directly use the Spark-Kernel as a kernel for communicating with > > Apache Spark. However, we believe that the Spark-Kernel provides a > > general-purpose mechanism enabling a wider variety of applications than > > just Notebooks to access Spark, and so the Spark-Kernel’s greatest > > affinity is with Apache and Apache Spark. > > > > == Known Risks == > > === Orphaned products === > > We believe the Spark-Kernel project has a low-risk of abandonment due to > > interest in its continuing existence from several parties. More > > specifically, the Spark-Kernel provides a capability that is not provided > > by Apache Spark today but it enables a wider range of applications to > > leverage Spark. For example, IBM uses (and is considering) the > > Spark-Kernel in several offerings including its IBM Analytics for Apache > > Spark product in the Bluemix Cloud. There are also a couple of other > > commercial users who are using or considering its use in their offerings. > > Furthermore, Jupyter Notebooks are used by data scientists and Spark is > > gaining popularity as an analytic engine for them. Jupyter Notebooks are > > very easily enabled with the Spark-Kernel and so there is another > > constituency for it. > > > > === Inexperience with Open Source === > > The Spark-Kernel project has been running as an open-source project > > (albeit with only IBM committers) for the past several months. The project > > has an active issue tracker and due to the interest indicated by the > > nature and volume of requests and comments, the team has publicly stated > > it is beginning to build a process so they can accept third-party > > contributions to the project. > > > > === Relationships with Other Apache Products === > > The Spark-Kernel has a clear affinity with the Apache Spark project > > because it is designed to provide capabilities for interactive > > applications and data science tools not provided by Spark itself. The > > Spark-Kernel can be a back-end for the Zeppelin project currently > > incubating at Apache. There is interest from the Spark-Kernel community to > > develop this capability and an experimental branch has been started. > > > > === Homogeneous Developers === > > The current group of developers working on Spark-Kernel are all from IBM > > although the group is in the process of expanding its membership to > > include members of the GitHub community who are not from IBM and who have > > been active in the Spark-Kernel community in GutHub. > > > > === Reliance on Salaried Developers === > > The initial committers are full-time employees at IBM although not all > > work on the project full-time. > > > > === Excessive Fascination with the Apache Brand === > > We believe the Spark-Kernel benefits Apache Spark application developers, > > and we are interested in an Apache Spark-Kernel project to benefit these > > developers by engaging a larger community, facilitating closer ties with > > the existing Spark project, and yes, gaining more visibility for the > > Spark-Kernel as a solution. > > > > We have recently become aware that the project name “Spark-Kernel” may be > > interpreted as having an association with an Apache project. If the > > project is accepted by Apache, we suggest the project name remains the > > same, but otherwise we will change it to one that does not imply any > > Apache association. > > > > === Documentation === > > Comprehensive documentation including “Getting Started”, API > > specifications and a Roadmap are available from the GitHub project, see > > https://github.com/ibm-et/spark-kernel/wiki. > > > > === Initial Source === > > The source code resides at https://github.com/ibm-et/spark-kernel. > > > > === External Dependencies === > > The Spark-Kernel depends upon a number of Apache projects: > > * Spark > > * Hadoop > > * Ivy > > * Commons > > > > The Spark-Kernel also depends upon a number of other open source projects: > > * JeroMQ (LGPL with Static Linking Exception, > > http://zeromq.org/area:licensing) > > * Akka (MIT) > > * JOpt Simple (MIT) > > * Spring Framework Core (Apache v2) > > * Play (Apache v2) > > * SLF4J (MIT) > > * Scala > > * Scalatest (Apache v2) > > * Scalactic (Apache v2) > > * Mockito (MIT) > > > > == Required Resources == > > Developer and user mailing lists > > * priv...@spark-kernel.incubator.apache.org (with moderated subscriptions) > > * comm...@spark-kernel.incubator.apache.org > > * d...@spark-kernel.incubator.apache.org > > * us...@spark-kernel.incubator.apache.org > > > > A git repository: > > https://git-wip-us.apache.org/repos/asf/incubator-spark-kernel.git > > > > A JIRA issue tracker: https://issues.apache.org/jira/browse/SPARK-KERNEL > > > > == Initial Committers == > > The initial list of committers is: > > * Leugim Bustelo (g...@bustelos.com) > > * Jakob Odersky (joder...@gmail.com) > > * Luciano Resende (lrese...@apache.org) > > * Robert Senkbeil (chip.senkb...@gmail.com) > > * Corey Stubbs (cas5...@gmail.com) > > * Miao Wang (wm...@hotmail.com) > > * Sean Welleck (welle...@gmail.com) > > > > === Affiliations === > > All of the initial committers are employed by IBM. > > > > == Sponsors == > > === Champion === > > * Sam Ruby (IBM) > > > > === Nominated Mentors === > > * Luciano Resende > > > > We wish to recruit additional mentors during incubation. > > > > === Sponsoring Entity === > > The Apache Incubator. > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org >