Re: [PROPOSAL] Apache Spark for the Incubator

Konstantin Boudnik Fri, 28 Jun 2013 22:42:40 -0700

That makes sense. Thanks for the update - I am still catching up on my emails
backed up because of the Hadoop summit.


Cos

On Tue, Jun 04, 2013 at 01:44AM, Mattmann, Chris A (398J) wrote:
> Dear Konstantin,
> 
> Thanks! The incoming Spark project is excited about the relationship
> with Bigtop that could happen here.
> 
> As for new committers, after conferring with the Spark project
> members, we would like to adopt a simple policy of having all new
> committers not add themselves to the wiki as of yet, but simply
> join the project mailing lists when they are created, and then from
> there, contribute. I and other mentors, and the Spark community are
> committed to being inclusive, so hopefully won't take too long for
> anybody to become a PPMC member/committer on the project after some
> demonstrated contributions.
> 
> Thanks for your interest and again for your kind words.
> 
> Cheers!
> 
> Chris
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Konstantin Boudnik <c...@apache.org>
> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
> Date: Friday, May 31, 2013 12:29 PM
> To: "general@incubator.apache.org" <general@incubator.apache.org>
> Subject: Re: [PROPOSAL] Apache Spark for the Incubator
> 
> >Great news!
> >
> >Definitely +1 (non-binding, I guess) on adding Spark to the family
> >of ASF project!
> >
> >I also express the interest to contribute to the project and move it
> >forward
> >to the graduation! Bigtop has been packaging and providing Spark as a
> >part of
> >Hadoop 1.x software stacks for some time; and hopefully would be able to
> >offer
> >it as a part of Hadoop 2.x line in the coming days.
> >
> >Dr. Konstantin Boudnik
> >  Hadoop committer
> >  BigTop PMC
> >
> >On Fri, May 31, 2013 at 06:03PM, Mattmann, Chris A (398J) wrote:
> >> Hi Folks,
> >> 
> >> I'm pleased to bring you a proposal to the Apache Incubator for the
> >>Apache
> >> Spark project: https://wiki.apache.org/incubator/SparkProposal
> >> 
> >> The work originates from the Berkeley AMPLab and through a number of
> >> industry
> >> participants, and other institutions. Spark is a framework for
> >>large-scale
> >> data 
> >> analysis on clusters, with a particular focus on low latency operations.
> >> The
> >> source code is written in Scala, and provides a number of APIs and
> >>bindings
> >> in various programming languages.
> >> 
> >> The proposal text is copied to the bottom of this email. I'm going to
> >>leave
> >> this thread open for the next week for discussion. Once it's died down,
> >> I'll
> >> call an official VOTE.
> >> 
> >> Suresh, Ross G. -- heads up -- this project may be of interest to you
> >>both
> >> and would welcome you guys as additional mentors. We currently have 3
> >> mentors
> >> committed to the project, but would love to have more. People
> >>interested in
> >> contributing should declare their interest here on the general@incubator
> >> thread
> >> and those potential contributors will be discussed by the incoming Spark
> >> community.
> >> 
> >> Questions -- let's hear em'! :)
> >> 
> >> Cheers,
> >> Chris
> >> ("Champion", incoming Apache Spark)
> >> 
> >> === Abstract ===
> >> Spark is an open source system for large-scale data analysis on
> >>clusters.
> >> 
> >> === Proposal ===
> >> Spark is an open source system for fast and flexible large-scale data
> >> analysis. Spark provides a general purpose runtime that supports
> >> low-latency execution in several forms. These include interactive
> >> exploration of very large datasets, near real-time stream processing,
> >>and
> >> ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
> >> with HDFS, HBase, Cassandra and several other storage storage layers,
> >>and
> >> exposes APIs in Scala, Java and Python.
> >> Background
> >> Spark started as U.C. Berkeley research project, designed to efficiently
> >> run machine learning algorithms on large datasets. Over time, it has
> >> evolved into a general computing engine as outlined above. Spark╧s
> >> developer community has also grown to include additional institutions,
> >> such as universities, research labs, and corporations. Funding has been
> >> provided by various institutions including the U.S. National Science
> >> Foundation, DARPA, and a number of industry sponsors. See:
> >> https://amplab.cs.berkeley.edu/sponsors/ for full details.
> >> 
> >> === Rationale ===
> >> As the number of contributors to Spark has grown, we have sought for a
> >> long-term home for the project, and we believe the Apache foundation
> >>would
> >> be a great fit. Spark is a natural fit for the Apache foundation: Spark
> >> already interoperates with several existing Apache projects (HDFS,
> >>HBase,
> >> Hive, Cassandra, Avro and Flume to name a few). The Spark team is
> >>familiar
> >> with the Apache process and and subscribes to the Apache mission - the
> >> team includes multiple Apache committers already. Finally, joining
> >>Apache
> >> will help coordinate the development effort of the growing number of
> >> organizations which contribute to Spark.
> >> 
> >> == Initial Goals ==
> >> The initial goals will most likely be to move the existing codebase to
> >> Apache and integrate with the Apache development process. Furthermore,
> >>we
> >> plan for incremental development, and releases along with the Apache
> >> guidelines.
> >> 
> >> === Current Status ===
> >> == Meritocracy ==
> >> The Spark project already operates on meritocratic principles. Today,
> >> Spark has several developers and has accepted multiple major patches
> >>from
> >> outside of U.C. Berkeley. While this process has remained mostly
> >>informal
> >> (we do not have an official committer list), an implicit organization
> >> exists in which individuals who contribute major components act as
> >> maintainers for those modules. If accepted, the Spark project would
> >> include several of these participants as committers from the onset. We
> >> will work to identify all committers and PPMC members for the project
> >>and
> >> to operate under the ASF meritocratic principles.
> >> 
> >> === Community ===
> >> Acceptance into the Apache foundation would bolster the already strong
> >> user and developer community around Spark. That community includes
> >>dozens
> >> of contributors from several institutions, a meetup group with several
> >> hundred members, and an active mailing list composed of hundreds of
> >>users.
> >> Core Developers
> >> The core developers of our project are listed in our contributors and
> >> initial PPMC below. Though many exist at UC Berkeley, there is a
> >> representative cross sampling of other organizations including
> >>Quantifind,
> >> Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.
> >> 
> >> 
> >> === Alignment ===
> >> Our proposed effort aligns with several ongoing BIGDATA and U.S.
> >>National
> >> priority funding interests including the NSF and its Expeditions
> >>program,
> >> and the DARPA XDATA project. Our industry partners and collaborators are
> >> well aligned with our code base.
> >> 
> >> There are also a number of related Apache projects and dependencies,
> >>that
> >> will be mentioned in the Relationships with Other Apache products
> >>section.
> >> 
> >> == Known Risks ==
> >> 
> >> === Orphaned Products ===
> >> Given the current level of investment in Spark - the risk of the project
> >> being abandoned is minimal. There are several constituents who are
> >>highly
> >> incentivized to continue development. The U.C. Berkeley AMPLab relies on
> >> Spark as a platform for a large number of long-term research projects.
> >> Several companies have build verticalized products which are tightly
> >> dependent on Spark. Other companies have devoted significant internal
> >> infrastructure investment in Spark.
> >> 
> >> === Inexperience with Open Source ===
> >> Spark has existed as a healthy open source project for several years.
> >> During that time, Matei and others have curated an open-source community
> >> successfully, attracting developers from a diverse group of companies
> >> including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel,
> >>and
> >> Webtrends. 
> >> 
> >> === Homogenous Developers ===
> >> The initial list of committers includes developers from several
> >> institutions, including Quantifind, Microsoft, Yahoo!, ClearStory Data,
> >> Bizo, Intel, and Webtrends.
> >> 
> >> === Reliance on Salaried Developers ===
> >> Like most open source projects, Spark receives a substantial support
> >>from
> >> salaried developers. A large fraction of Spark development is supported
> >>by
> >> graduate students at U.C. Berkeley in the course of research degrees -
> >> this is more a Ёvolunteer╡ relationship, since in most cases students
> >> contribute vastly more than is necessary to immediately support
> >>research.
> >> In addition, those working from within corporations often devote Ёafter
> >> hours╡ or spare time in the project - and these come from several
> >> organizations. We will work to ensure that the ability for the project
> >>to
> >> continuously be stewarded and to proceed forward independent of salaried
> >> developers is continued.
> >> 
> >> 
> >> === Relationship with Other Apache Products ===
> >> Spark inter-operates with several existing Apache products by supporting
> >> them as storage layers: Apache Cassandra, Apache HBase, and Apache
> >>Hadoop
> >> (HDFS). It also uses several Apache components internally including
> >>Apache
> >> Maven and several Apache Commons libraries. Finally, Shark (a higher
> >>layer
> >> framework built on Spark) inter-operates with Apache Hive. We will
> >>explore
> >> the relationship between Spark and Apache Gora, which also provides
> >> in-memory object storage (Champion Mattmann was the Champion for Apace
> >> Gora so we expect alignment and cross pollination between our efforts).
> >> 
> >> Spark offers an alternative computation engine to Apache Hadoop
> >> (MapReduce). Unlike MapReduce, Spark is designed for lower-latency and
> >> interactive workloads. This makes the projects complimentary: many users
> >> run MapReduce and Spark side-by-side.
> >> 
> >> === A Excessive Fascination with the Apache Brand ===
> >> Spark is already a healthy and relatively well known open source
> >>project.
> >> This proposal is not for the purpose of generating publicity. Rather,
> >>the
> >> primary benefits to joining Apache are those outlined in the Rationale
> >> section.
> >> 
> >> === Documentation ===
> >> The reader will find these websites highly relevant:
> >>  * Spark website: http://spark-project.org/
> >>  * Spark documentation: http://spark-project.org/documentation/
> >>  * Issue tracking: https://spark-project.atlassian.net/
> >>  * Codebase: https://github.com/mesos/spark
> >>  * User group: https://groups.google.com/group/spark-users
> >> 
> >> == Initial Source ==
> >> The Spark codebase is currently hosted on Github:
> >> https://github.com/mesos/spark. This is the exact codebase that we would
> >> migrate to the Apache foundation.
> >> Source and Intellectual Property Submission Plan
> >> Currently, the Spark codebase is distributed under a BSD license. The
> >>vast
> >> majority of code has copyright held by the University of California.
> >>Upon
> >> entering Apache, Spark will migrate to an Apache License with all
> >> copyright assigned to the Apache Foundation. The University of
> >>California
> >> will transfer all copyright to the Apache Foundation. In certain cases
> >> where individuals hold copyright, we will have individuals sign over
> >> copyright to the Apache foundation as well.
> >> 
> >> Going forward, all commits would assign copyright directly to the Apache
> >> foundation through our signed Individual Contributor License Agreements
> >> for all initial committers on the project.
> >> 
> >> 
> >> == External Dependencies ==
> >> To the best of our knowledge, all dependencies of Spark are distributed
> >> under Apache compatible licenses. Upon acceptance to the incubator, we
> >> would begin a thorough analysis of all transitive dependencies to verify
> >> this fact and introduce license checking into the build and release
> >> process (for instance integrating Apache Rat).
> >> 
> >> == Required Resources ==
> >> === Mailing list ===
> >> We will migrate the existing Spark mailing lists as follows:
> >> 
> >>  * spark-users@googlegroups --> us...@spark.incubator.apache.org
> >>  * spark-developers@googlegroups --> d...@spark.incubator.apache.org
> >>  * spark-commits are hosted on Github, so we would request
> >> comm...@spark.incubator.apache.org
> >> 
> >> The latter is to be consistent with the new PIAO naming scheme for
> >> podlings.
> >> 
> >> === Source control ===
> >> The Spark team would like to use Git for source control, due to our
> >> current use of Git.
> >> We request a writeable Git repo for Spark, and mirroring to be set up to
> >> Github through INFRA. Champion Mattmann can assist with creating INFRA
> >> tickets for this.
> >> 
> >> === Issue Tracking ===
> >> Spark currently uses a hosted JIRA deployment for issue tracking. We
> >>will
> >> migrate to the Apache JIRA.
> >> http://issues.apache.org/jira/browse/SPARK
> >> 
> >> == Initial Committers ==
> >>  * Matei Zaharia <ma...@apache.org>
> >>  * Ankur Dave <ankurd...@gmail.com>
> >>  * Tathagata Das <t...@eecs.berkeley.edu>
> >>  * Haoyuan Li <haoy...@cs.berkeley.edu>
> >>  * Josh Rosen <joshro...@cs.berkeley.edu>
> >>  * Reynold Xin <r...@cs.berkeley.edu>
> >>  * Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
> >>  * Mosharaf Chowdhury <mosha...@cs.berkeley.edu>
> >>  * Charles Reiss <char...@eecs.berkeley.edu>
> >>  * Andy Konwinski <andykonwin...@gmail.com>
> >>  * Patrick Wendell <pwend...@eecs.berkeley.edu>
> >>  * Imran Rashid <im...@quantifind.com>
> >>  * Ryan LeCompte <lecom...@gmail.com>
> >>  * Ravi Pandya <ra...@exchange.microsoft.com>
> >>  * Ram Sriharsha <harsh...@yahoo-inc.com>
> >>  * Robert Evans <ev...@yahoo-inc.com>
> >>  * Mridul Muralidharan <mrid...@yahoo-inc.com>
> >>  * Thomas Dudziak <to...@clearstorydata.com>
> >>  * Mark Hamstra <m...@clearstorydata.com>
> >>  * Stephen Haberman <stephen.haber...@gmail.com>
> >>  * Shane Huang <shannie.hu...@gmail.com>
> >>  * Andrew xia <xiajunl...@gmail.com>
> >>  * Nick Pentreath <nick.pentre...@gmail.com>
> >>  * Sean McNamara <sean.mcnam...@webtrends.com>
> >> 
> >> == Affiliations ==
> >> The initial committers are from nine organizations: UC Berkeley,
> >> Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Mxit and
> >> Webtrends.
> >> 
> >>  * Matei Zaharia (UCB)
> >>  * Ankur Dave (UCB)
> >>  * Tathagata Das (UCB)
> >>  * Haoyuan Li (UCB)
> >>  * Josh Rosen (UCB)
> >>  * Reynold Xin (UCB)
> >>  * Shivaram Venkataraman (UCB)
> >>  * Mosharaf Chowdhury (UCB)
> >>  * Charles Reiss (UCB)
> >>  * Andy Konwinski (UCB)
> >>  * Patrick Wendell (UCB)
> >>  * Imran Rashid (Quantifind)
> >>  * Ryan LeCompte (Quantifind)
> >>  * Ravi Pandya (Microsoft)
> >>  * Ram Sriharsha (Yahoo!)
> >>  * Robert Evans (Yahoo!)
> >>  * Mridul Muralidharam (Yahoo!)
> >>  * Thomas Dudziak (ClearStory)
> >>  * Mark Hamstra (ClearStory)
> >>  * Stephen Haberman (Bizo)
> >>  * Shane Huang (Intel)
> >>  * Andrew Xia (Intel)
> >>  * Nick Pentreath (Mxit)
> >>  * Sean McNamara (Webtrends)
> >> 
> >> == Sponsors ==
> >> === Champion ===
> >>  * Chris Mattmann
> >> 
> >> === Nominated Mentors ===
> >>  * Chris Mattmann
> >>  * Paul Ramirez 
> >>  * Andrew Hart 
> >> 
> >> === Sponsoring Entity ===
> >>  The Apache Incubator
> >> 
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Senior Computer Scientist
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 171-266B, Mailstop: 171-246
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Assistant Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> 
> >> 
> >> 
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >> 
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >For additional commands, e-mail: general-h...@incubator.apache.org
> >
>

signature.asc
Description: Digital signature

Re: [PROPOSAL] Apache Spark for the Incubator

Reply via email to