Hi Taylor, I don't know the Spark community's opinion on the "outright vs
subproject" issue, although I have told a couple people in that community about
the proposal and have posted an FYI to the spark-dev list. From a technical
perspective, Spark-Kernel mainly uses public Spark APIs (except for some SparkR
usage,
https://github.com/ibm-et/spark-kernel/blob/master/sparkr-interpreter/src/main/resources/README.md)
and so I guess the answer could go either way depending on the Spark community.
Thanks,
David

> On November 12, 2015 at 8:05 PM "P. Taylor Goetz" <ptgo...@gmail.com> wrote:
>
>
> Just a quick (or maybe not :) ) question...
>
> Given the tight coupling to the Apache Spark project, were there any
> considerations or discussions with the Spark community regarding including the
> Spark-Kernel functionality outright in Spark, or the possibility of becoming a
> subproject?
>
> I'm just curious. I don't think an answer one way or another would necessarily
> block incubation.
>
> -Taylor
>
> > On Nov 12, 2015, at 7:17 PM, da...@fallside.com wrote:
> >
> > Hello, we would like to start a discussion on accepting the Spark-Kernel,
> > a mechanism for applications to interactively and remotely access Apache
> > Spark, into the Apache Incubator.
> >
> > The proposal is available online at
> > https://wiki.apache.org/incubator/SparkKernelProposal, and it is appended
> > to this email.
> >
> > We are looking for additional mentors to help with this project, and we
> > would much appreciate your guidance and advice.
> >
> > Thank-you in advance,
> > David Fallside
> >
> >
> >
> > = Spark-Kernel Proposal =
> >
> > == Abstract ==
> > Spark-Kernel provides applications with a mechanism to interactively and
> > remotely access Apache Spark.
> >
> > == Proposal ==
> > The Spark-Kernel enables interactive applications to access Apache Spark
> > clusters. More specifically:
> > * Applications can send code-snippets and libraries for execution by Spark
> > * Applications can be deployed separately from Spark clusters and
> > communicate with the Spark-Kernel using the provided Spark-Kernel client
> > * Execution results and streaming data can be sent back to calling
> > applications
> > * Applications no longer have to be network connected to the workers on a
> > Spark cluster because the Spark-Kernel acts as each application’s proxy
> > * Work has started on enabling Spark-Kernel to support languages in
> > addition to Scala, namely Python (with PySpark), R (with SparkR), and SQL
> > (with SparkSQL)
> >
> > == Background & Rationale ==
> > Apache Spark provides applications with a fast and general purpose
> > distributed computing engine that supports static and streaming data,
> > tabular and graph representations of data, and an extensive library of
> > machine learning libraries. Consequently, a wide variety of applications
> > will be written for Spark and there will be interactive applications that
> > require relatively frequent function evaluations, and batch-oriented
> > applications that require one-shot or only occasional evaluation.
> >
> > Apache Spark provides two mechanisms for applications to connect with
> > Spark. The primary mechanism launches applications on Spark clusters using
> > spark-submit
> > (http://spark.apache.org/docs/latest/submitting-applications.html); this
> > requires developers to bundle their application code plus any dependencies
> > into JAR files, and then submit them to Spark. A second mechanism is an
> > ODBC/JDBC API
> > (http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine)
> > which enables applications to issue SQL queries against SparkSQL.
> >
> > Our experience when developing interactive applications, such as analytic
> > applications and Jupyter Notebooks, to run against Spark was that the
> > spark-submit mechanism was overly cumbersome and slow (requiring JAR
> > creation and forking processes to run spark-submit), and the SQL interface
> > was too limiting and did not offer easy access to components other than
> > SparkSQL, such as streaming. The most promising mechanism provided by
> > Apache Spark was the command-line shell
> > (http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell)
> > which enabled us to execute code snippets and dynamically control the
> > tasks submitted to a Spark cluster. Spark does not provide the
> > command-line shell as a consumable service but it provided us with the
> > starting point from which we developed the Spark-Kernel.
> >
> > == Current Status ==
> > Spark-Kernel was first developed by a small team working on an
> > internal-IBM Spark-related project in July 2014. In recognition of its
> > likely general utility to Spark users and developers, in November 2014 the
> > Spark-Kernel project was moved to GitHub and made available under the
> > Apache License V2.
> >
> > == Meritocracy ==
> > The current developers are familiar with the meritocratic open source
> > development process at Apache. As the project has gathered interest at
> > GitHub the developers have actively started a process to invite additional
> > developers into the project, and we have at least one new developer who is
> > ready to contribute code to the project.
> >
> > == Community ==
> > We started building a community around the Spark-Kernel project when we
> > moved it to GitHub about one year ago. Since then we have grown to about
> > 70 people, and there are regular requests and suggestions from the
> > community. We believe that providing Apache Spark application developers
> > with a general-purpose and interactive API holds a lot of community
> > potential, especially considering possible tie-in’s with the Jupyter and
> > data science community.
> >
> > == Core Developers ==
> > The core developers of the project are currently all from IBM, from the
> > IBM Emerging Technology team and from IBM’s recently formed Spark
> > Technology Center.
> >
> > == Alignment ==
> > Apache, as the home of Apache Spark, is the most natural home for the
> > Spark-Kernel project because it was designed to work with Apache Spark and
> > to provide capabilities for interactive applications and data science
> > tools not provided by Spark itself.
> >
> > The Spark-Kernel also has an affinity with Jupyter (jupyter.org) because
> > it uses the Jupyter protocol for communications, and so Jupyter Notebooks
> > can directly use the Spark-Kernel as a kernel for communicating with
> > Apache Spark. However, we believe that the Spark-Kernel provides a
> > general-purpose mechanism enabling a wider variety of applications than
> > just Notebooks to access Spark, and so the Spark-Kernel’s greatest
> > affinity is with Apache and Apache Spark.
> >
> > == Known Risks ==
> > === Orphaned products ===
> > We believe the Spark-Kernel project has a low-risk of abandonment due to
> > interest in its continuing existence from several parties. More
> > specifically, the Spark-Kernel provides a capability that is not provided
> > by Apache Spark today but it enables a wider range of applications to
> > leverage Spark. For example, IBM uses (and is considering) the
> > Spark-Kernel in several offerings including its IBM Analytics for Apache
> > Spark product in the Bluemix Cloud. There are also a couple of other
> > commercial users who are using or considering its use in their offerings.
> > Furthermore, Jupyter Notebooks are used by data scientists and Spark is
> > gaining popularity as an analytic engine for them. Jupyter Notebooks are
> > very easily enabled with the Spark-Kernel and so there is another
> > constituency for it.
> >
> > === Inexperience with Open Source ===
> > The Spark-Kernel project has been running as an open-source project
> > (albeit with only IBM committers) for the past several months. The project
> > has an active issue tracker and due to the interest indicated by the
> > nature and volume of requests and comments, the team has publicly stated
> > it is beginning to build a process so they can accept third-party
> > contributions to the project.
> >
> > === Relationships with Other Apache Products ===
> > The Spark-Kernel has a clear affinity with the Apache Spark project
> > because it is designed to provide capabilities for interactive
> > applications and data science tools not provided by Spark itself. The
> > Spark-Kernel can be a back-end for the Zeppelin project currently
> > incubating at Apache. There is interest from the Spark-Kernel community to
> > develop this capability and an experimental branch has been started.
> >
> > === Homogeneous Developers ===
> > The current group of developers working on Spark-Kernel are all from IBM
> > although the group is in the process of expanding its membership to
> > include members of the GitHub community who are not from IBM and who have
> > been active in the Spark-Kernel community in GutHub.
> >
> > === Reliance on Salaried Developers ===
> > The initial committers are full-time employees at IBM although not all
> > work on the project full-time.
> >
> > === Excessive Fascination with the Apache Brand ===
> > We believe the Spark-Kernel benefits Apache Spark application developers,
> > and we are interested in an Apache Spark-Kernel project to benefit these
> > developers by engaging a larger community, facilitating closer ties with
> > the existing Spark project, and yes, gaining more visibility for the
> > Spark-Kernel as a solution.
> >
> > We have recently become aware that the project name “Spark-Kernel” may be
> > interpreted as having an association with an Apache project. If the
> > project is accepted by Apache, we suggest the project name remains the
> > same, but otherwise we will change it to one that does not imply any
> > Apache association.
> >
> > === Documentation ===
> > Comprehensive documentation including “Getting Started”, API
> > specifications and a Roadmap are available from the GitHub project, see
> > https://github.com/ibm-et/spark-kernel/wiki.
> >
> > === Initial Source ===
> > The source code resides at https://github.com/ibm-et/spark-kernel.
> >
> > === External Dependencies ===
> > The Spark-Kernel depends upon a number of Apache projects:
> > * Spark
> > * Hadoop
> > * Ivy
> > * Commons
> >
> > The Spark-Kernel also depends upon a number of other open source projects:
> > * JeroMQ (LGPL with Static Linking Exception,
> > http://zeromq.org/area:licensing)
> > * Akka (MIT)
> > * JOpt Simple (MIT)
> > * Spring Framework Core (Apache v2)
> > * Play (Apache v2)
> > * SLF4J (MIT)
> > * Scala
> > * Scalatest (Apache v2)
> > * Scalactic (Apache v2)
> > * Mockito (MIT)
> >
> > == Required Resources ==
> > Developer and user mailing lists
> > * priv...@spark-kernel.incubator.apache.org (with moderated subscriptions)
> > * comm...@spark-kernel.incubator.apache.org
> > * d...@spark-kernel.incubator.apache.org
> > * us...@spark-kernel.incubator.apache.org
> >
> > A git repository:
> > https://git-wip-us.apache.org/repos/asf/incubator-spark-kernel.git
> >
> > A JIRA issue tracker: https://issues.apache.org/jira/browse/SPARK-KERNEL
> >
> > == Initial Committers ==
> > The initial list of committers is:
> > * Leugim Bustelo (g...@bustelos.com)
> > * Jakob Odersky (joder...@gmail.com)
> > * Luciano Resende (lrese...@apache.org)
> > * Robert Senkbeil (chip.senkb...@gmail.com)
> > * Corey Stubbs (cas5...@gmail.com)
> > * Miao Wang (wm...@hotmail.com)
> > * Sean Welleck (welle...@gmail.com)
> >
> > === Affiliations ===
> > All of the initial committers are employed by IBM.
> >
> > == Sponsors ==
> > === Champion ===
> > * Sam Ruby (IBM)
> >
> > === Nominated Mentors ===
> > * Luciano Resende
> >
> > We wish to recruit additional mentors during incubation.
> >
> > === Sponsoring Entity ===
> > The Apache Incubator.
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

Reply via email to