Re: [VOTE] Accept MADlib into the Apache Incubator

Frank McQuillan Thu, 10 Sep 2015 16:08:18 -0700

+1 (non-binding)

On Thu, Sep 10, 2015 at 3:57 PM, Caleb Welton <cwel...@pivotal.io> wrote:


> +1 (non-binding)
>
> On Thu, Sep 10, 2015 at 12:53 PM, Rahul Iyer <rahulri...@gmail.com> wrote:
>
> > +1 (non-binding)
> >
> > On Wed, Sep 09, 2015 at 07:37PM, Roman Shaposhnik wrote:
> > >
> > > Following the discussion earlier:
> > >    http://s.apache.org/TE6
> > >
> > > I would like to call a VOTE for accepting
> > > MADlib community as a new ASF incubator
> > > project.
> > >
> > > The proposal is available at:
> > >     https://wiki.apache.org/incubator/MADlibProposal
> > > and is also included at the bottom of this email.
> > >
> > > Vote is open until at least Mon, 14 September 2015, 23:59:00 PST
> > >
> > >  [ ] +1 accept MADlib into the Apache Incubator
> > >  [ ] ±0
> > >  [ ] -1 because...
> > >
> > > Thanks,
> > > Roman.
> > >
> > > == Abstract ==
> > > MADlib is an open-source library (licensed under 2-clause BSD license)
> > > for scalable in-database analytics. It provides data-parallel
> > > implementations of mathematical, statistical and machine learning
> > > methods for structured and unstructured data. The MADlib mission is to
> > > foster widespread development of scalable analytic skills, by
> > > harnessing efforts from commercial practice, academic research, and
> > > open source development.
> > >
> > > MADlib occupies a unique niche in the realm of data science and
> > > machine learning libraries since its SQL APIs can allow it to work on
> > > a wide range of data stores and SQL engines.
> > >
> > > == Proposal ==
> > > The current open source community behind MADlib feels that aligning
> > > itself with HAWQ's community, governance model, infrastructure and
> > > roadmap will allow the project to accelerate adoption and community
> > > growth. Given HAWQ's trajectory of entering Apache Software Foundation
> > > family as an Incubating project, we feel that the best course of
> > > action for MADlib is to follow a similar route.
> > >
> > > MADlib and HAWQ are complementary technologies in that MADlib
> > > in-database analytical functions can run within the HAWQ execution
> > > engine. (MADlib also runs on Greenplum Database and PostgreSQL today.)
> > > It is expected that contributors to MADlib will be cognizant of the
> > > HAWQ ASF project and may contribute to it as well.  In short,
> > > collaboration between the two communities will make both projects more
> > > vibrant and advance the respective technologies in potentially novel
> > > directions.
> > >
> > > Contributors may also look at the HAWQ project as a starting port for
> > > ports to other parallel database engines. This proposal highly
> > > encourages this type of work as it would help to further realize the
> > > original cross-platform goal of MADlib as envisioned by its
> > > originators.
> > >
> > > Thus, the goal of this proposal is to bring the existing MADlib open
> > > source community into ASF, change the project's governance model to
> > > the "Apache Way" and transition the project's codebase and
> > > infrastructure into ASF INFRA. The community has agreed to transfer
> > > the brand name "MADlib" to Apache Software Foundation as well.
> > >
> > > Pivotal Inc. on behalf of the MADlib open source community is
> > > submitting this proposal to transition source code and associated
> > > artifacts (documentation, web site content, wiki, etc.) to the Apache
> > > Software Foundation Incubator under the Apache License, Version 2.0
> > > and is asking Incubator PMC to established a MADlib incubating
> > > project.
> > >
> > > Currently MADlib uses a few category X licensed software tools during
> > > its build (mostly for generating documentation):
> > >    * doxypy 0.4.2 (GPL)
> > >    * doxygen 1.8.4 (GPL)
> > >    * TikZ-UML
> > >    * bison 2.4 (GPL, with an exception for generated output)
> > > We feel that this usage is compatible with an overall project licensed
> > > under the ALv2 and don't anticipate any changes.
> > > Our usage of LGPL library cern_root-5.34 is expected to go away since
> > > the 2 cern modules used are being entirely re-written
> > > in MADlib
> > >
> > > Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into
> > > its binary artifact seems to be consistent with
> > > ASF recommendation for managing "weak copyleft" dependencies.
> > >
> > >
> > > == Background ==
> > > MADlib grew out of discussions between database engine developers,
> > > data scientists, IT architects and academics interested in new
> > > approaches to scalable, sophisticated in-database analytics. These
> > > discussions were written up in a paper in VLDB 2009 that coined the
> > > term “MAD Skills” for data analysis
> > > (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software
> > > project began the following year as a collaboration between
> > > researchers at UC Berkeley and engineers and data scientists at
> > > Pivotal (former EMC/Greenplum).
> > >
> > > The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the
> > > University of Wisconsin, and the University of Florida.  The project
> > > was publicly documented in a paper at VLDB 2012
> > > (http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf).  Today
> > > MADlib has contributors from around the world including both
> > > individuals and institutions.  For example, recent contributions have
> > > come from Pivotal, Stanford University, and the University of Illinois
> > > at Chicago.
> > >
> > > MADlib was conceived from the outset as a free, open source library
> > > for all to use and contribute to.  Since its inception, the community
> > > has steadily added new methods in the areas of mathematics,
> > > statistics, machine learning, and data transformation.  The current
> > > library includes over 30 principle algorithms as well as many
> > > additional operators and utility functions.
> > >
> > > The methods in MADlib are designed both for in- or out-of-core
> > > execution, and for the shared-nothing, scale-out parallelism offered
> > > by modern parallel database engines, ensuring that computation is done
> > > close to the data. The core functionality is written in declarative
> > > SQL statements, which orchestrate data movement to and from disk, and
> > > across networked machines. Single-node inner loops take advantage of
> > > SQL extensibility to call out to high performance math libraries in
> > > user-defined scalar and aggregate functions. At the highest level,
> > > tasks that require iteration and/or structure definition are coded in
> > > Python driver routines, which are used only to kick off the data-rich
> > > computations that happen within the database engine.
> > >
> > > The first platforms supported by MADlib were Greenplum Database and
> > > PostgreSQL.  With the development of HAWQ SQL-on-Hadoop technology by
> > > Pivotal, MADlib offers a way to perform predictive analytics on very
> > > large data sets stored on a Hadoop cluster.
> > >
> > > Today, MADlib is in active development and is deployed on a wide
> > > variety of industry and academic projects across many different
> > > verticals.
> > >
> > > == Rationale ==
> > > Enterprises today are seeing the value of landing very large
> > > quantities of data in Hadoop clusters with the goal improving their
> > > products and processes.  With the proliferation of increasingly
> > > sophisticated SQL-on-Hadoop technologies such as HAWQ, analysts can
> > > use the familiar SQL language to query this data at scale.  This
> > > effectively opens the door to Hadoop in the enterprise.
> > >
> > > Adding SQL-based predictive analytics like MADlib to the equation
> > > enables organizations to reason across large data sets without
> > > resorting to sampling, which has been a traditional approach when
> > > confronted with scale problems.  Operating on all of the data with
> > > MADlib results in more robust and accurate models.
> > >
> > > Since MADlib is a SQL-based interface, organizations do not need to
> > > re-train their teams on an unfamiliar programming language since SQL
> > > skills are ubiquitous in today's enterprises.
> > >
> > > Given the high velocity of innovation happening in the underlying
> > > Hadoop ecosystem, any SQL-based predictive analytics technology that
> > > plays in this ecosystem must be commensurately agile to keep up with
> > > the community. We strongly believe that in the Big Data space, this
> > > can be optimally achieved through a vibrant, diverse, self-governed
> > > community collectively innovating around a single codebase while at
> > > the same time cross-pollinating with various other data management
> > > communities. Apache Software Foundation is the ideal place to meet
> > > those ambitious goals.
> > >
> > > == Initial Goals ==
> > > Our initial goals are to bring MADlib into the ASF, transition the
> > > engineering and governance processes to be in accordance with the
> > > "Apache Way" and foster a collaborative development model closely
> > > aligned with that of HAWQ.
> > >
> > > Another important goal is encouraging efforts to port to other
> > > execution engines.
> > >
> > > The MADlib project will continue developing new functionality in an
> > > open, community-driven way. We envision accelerating innovation under
> > > ASF governance, in order to meet the requirements of a wide variety of
> > > predictive analytics use cases.
> > >
> > > We will also require transitioning of existing project infrastructure
> > > (source code, JIRA, mailing list) to the ASF infrastructure.
> > >
> > > == Current Status ==
> > > Currently, the project is available at http://madlib.net/. The
> > > codebase is licensed under the a 2-clause BSD license. Our current
> > > governance model could be described as a "benevolent dictator" one. As
> > > stated above, the existing MADlib community feels that closer
> > > alignment with HAWQ community, infrastructure and the governance model
> > > as it is being proposed to ASF will allow MADlib project to thrive
> > > much more compared to relative isolation from HAWQ.
> > >
> > > === Meritocracy ===
> > > Our proposed list of initial committers include the current MADlib R&D
> > > team at Pivotal and existing active members of the open source
> > > project. This group will form a base for the broader community we will
> > > invite to collaborate on the codebase. We intend to radically expand
> > > the initial developer and user community by running the project in
> > > accordance with the "Apache Way". Users and new contributors will be
> > > treated with respect and welcomed. By participating in the community
> > > and providing quality patches/support that move the project forward,
> > > they will earn merit. They also will be encouraged to provide non-code
> > > contributions (documentation, events, community management, etc.) and
> > > will gain merit for doing so. Those with a proven support and quality
> > > track record will be encouraged to become committers.
> > >
> > > === Community ===
> > > If MADlib is accepted for incubation, the primary initial goal will be
> > > transitioning the core community towards embracing the Apache Way of
> > > project governance. We would solicit major existing contributors to
> > > become committers on the project from the start.
> > >
> > > === Core Developers ===
> > > MADlib core developers are skilled in working as part of openly
> > > governed communities. That said, most of the core developers are
> > > currently NOT affiliated with the ASF and would require new ICLAs
> > > before committing to the project.
> > >
> > > === Alignment ===
> > > The following existing ASF projects can be considered when reviewing
> > > the MADlib proposal:
> > >
> > > Apache Mahout project's goal is to build an environment for quickly
> > > creating scalable performant machine learning applications. Apache
> > > Mahout is, perhaps, the oldest machine learning library in Hadoop
> > > ecosystem. The three major components of Mahout are an environment for
> > > building scalable algorithms, many new Scala + Spark (H2O in progress)
> > > algorithms, and Mahout's mature Hadoop MapReduce algorithms. We see
> > > the two projects benefiting from each other's experience of
> > > implementing similar classes of algorithms and look forward to a
> > > fruitful exchange of ideas between the two communities.
> > >
> > > Apache Spark is a fast engine for processing large datasets, typically
> > > from a Hadoop cluster, and performing batch, streaming, interactive,
> > > or machine learning workloads.  Recently, Apache Spark has embraced
> > > SQL-like APIs around DataFrames at its core. Because of that we would
> > > expect a level of collaboration between the two projects. Spark
> > > project also contains a library (MLlib) that is the closest cousin to
> > > MADlib. MLlib is Apache Spark's scalable machine learning library. We
> > > see the two projects benefiting from each other's experience of
> > > implementing similar classes of algorithms and look forward to a
> > > fruitful exchange of ideas between the two communities.
> > >
> > > Apache Hive is a data warehouse software that facilitates querying and
> > > managing large datasets residing in distributed storage. Hive provides
> > > a mechanism to project structure onto this data and query the data
> > > using a SQL-like language called HiveQL. We see a potential for MADlib
> > > to leverage Hive as a backend the same way it currently leverages
> > > PostgreSQL-derived SQL backends. This could be especially useful for
> > > longer running algorithms.
> > >
> > > Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL and
> > > Cloud Storage. We see a potential for MADlib to leverage Drill as a
> > > backend the same way it currently leverages PostgreSQL-derived SQL
> > > backends. This could be especially useful for analyzing data coming
> > > from heterogenous sources and federated by the Drill engine.
> > >
> > > == Known Risks ==
> > > Development has been sponsored mostly by a single company (or its
> > > predecessors) thus far and coordinated mainly by the core Pivotal R&D
> > > team.
> > >
> > > So far, the project's governance model has explicitly been a
> > > "benevolent dictator" one. For the project to fully transition to the
> > > "Apache Way", development must shift towards the meritocracy-centric
> > > model of growing a community of contributors balanced with the needs
> > > for extreme stability and core implementation coherency.
> > >
> > > === Orphaned products ===
> > > The community proposing MADlib for incubation is an independent open
> > > source community. Even though Pivotal happens to be the biggest
> > > corporate sponsor of the project (by means of employing the core team)
> > > the community goes beyond those affiliated with Pivotal. On top of
> > > that, Pivotal is fully committed to maintain its position as one of
> > > the leading providers of SQL-based analytics aimed squarely at data
> > > scientists. MADlib is the only game in town that can leverage SQL APIs
> > > ranging from traditional RDBMS technology all the way to data
> > > warehousing (Pivotal Greenplum Database) and into SQL-on-Hadoop
> > > (HAWQ). Moreover, Pivotal has a vested interest in making MADlib
> > > succeed by driving its close integration with sister ASF projects. We
> > > expect this to further reduces the risk of orphaning the product.
> > >
> > > Even in the absence of support by a particular vendor such as Pivotal,
> > > and in a worst-case scenario where HAWQ and Greenplum Database fail to
> > > gain traction in OSS, the existence of an established PostgreSQL OSS
> > > project means there’s will still be a working stack for MADlib.
> > >
> > > === Inexperience with Open Source ===
> > > MADlib has been an open source project from the outset. All developers
> > > working on the project (regardless of their employment affiliation)
> > > did so completely in the open. While the governance model of MADlib
> > > has been more of a benevolent dictator model, the project has always
> > > been receptive to accepting contributions from all sources and
> > > including them in future releases based on thorough code review,
> > > testing, and compliance with the project’s coding best practices.
> > >
> > > === Homogeneous Developers ===
> > > While most of the initial committers are employed by Pivotal, there's
> > > still a healthy level of interest coming from academia. On top of that
> > > we expect to spark curiosity in sister ASF projects and attract
> > > developers unaffiliated with Pivotal. Finally, MADlib is being used
> > > extensively whenever Pivotal engages with customers on data science
> > > projects. This typically means that the skills remain within a
> > > customer organization which further increases the chance of turning
> > > customer data scientists into MADlib contributors.
> > >
> > > === Reliance on Salaried Developers ===
> > > A large percentage of the contributors are paid to work in the Big
> > > Data space. While they might wander from their current employers, they
> > > are unlikely to venture far from their core expertise and thus will
> > > continue to be engaged with the project regardless of their current
> > > employers. In addition, the project is still enjoying popularity in
> > > academic circles and we hope that will help mitigate reliance on
> > > salaried developers as well.
> > >
> > > === Relationships with Other Apache Products ===
> > > As mentioned in the Alignment section, MADlib may consider various
> > > degrees of integration and code exchange with Apache Spark (MLlib),
> > > Apache Mahout, Apache Hive and Apache Drill projects. We expect
> > > integration points to be inside and outside the project. We look
> > > forward to collaborating with these communities as well as other
> > > communities under the Apache umbrella.
> > >
> > > === An Excessive Fascination with the Apache Brand ===
> > > While we intend to leverage the Apache "brand" when talking to other
> > > projects as a testament to our project’s neutrality, we have no plans
> > > for making use of the Apache brand in press releases nor posting
> > > billboards advertising acceptance of MADlib into Apache Incubator.
> > >
> > > == Documentation ==
> > > The documentation is currently available at:
> > https://github.com/madlib/frontpage
> > >
> > > The documentation is currently licensed under 2-clause BSD license.
> > >
> > > == Initial Source ==
> > > Initial source code is available at:
> > >    * MADlib: https://github.com/madlib/madlib
> > >    * Testsuite: https://github.com/madlib/testsuite
> > >    * Contributors: https://github.com/madlib/contrib
> > >
> > > The code is currently licensed under 2-clause BSD license.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > > As soon as MADlib is approved to join the Incubator, the source code
> > > will be transitioned via the Software Grant Agreement onto ASF
> > > infrastructure and in turn made available under the Apache License,
> > > version 2.0.  We know of no legal encumbrances that would inhibit the
> > > transfer of source code to the ASF.
> > >
> > > == External Dependencies ==
> > >
> > > Runtime dependencies:
> > >    * boost-1.47.0 (Boost Software License)
> > >    * _m_widen_init (MIT for this subcomponent of GCC)
> > >    * python-argparse-1.2.1 (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1)
> > >    * pyyaml-3.10 (MIT license)
> > >    * cern_root-5.34 (LGPL, however this dependency will be removed
> > > since the 2 cern modules used are being entirely re-written in MADlib)
> > >    * eigen-3.2.2 (Mozilla Public License)
> > >    * pyxb-1.2.4 (Apache license version 2)
> > >    * python (Python Software Foundation License Version 2)
> > >    * mathjax-2.5 (Apache license version 2)
> > >
> > > Build only dependencies:
> > >    * doxypy-0.4.2 (GPL)
> > >    * cmake-2.8.4 (BSD 3-clause License)
> > >    * doxygen >= 1.8.4 (GPL)
> > >    * flex >= 2.5.33 (BSD)
> > >    * bison >= 2.4 (GPL)
> > >    * latex (LaTeX Project Public License)
> > >    * TikZ-UML (no license information)
> > >
> > > Cryptography
> > >    * N/A
> > >
> > > == Required Resources ==
> > >
> > > === Mailing lists ===
> > >   * priv...@madlib.incubator.apache.org (moderated subscriptions)
> > >   * comm...@madlib.incubator.apache.org
> > >   * d...@madlib.incubator.apache.org
> > >   * iss...@madlib.incubator.apache.org
> > >   * u...@madlib.incubator.apache.org
> > >
> > > === Git Repository ===
> > > https://git-wip-us.apache.org/repos/asf/incubator-madlib.git
> > >
> > > === Issue Tracking ===
> > > JIRA Project MADlib (MADLIB)
> > >
> > > We will also request migration of our current JIRA available at
> > > http://jira.madlib.net/
> > >
> > > === Other Resources ===
> > >
> > > Means of setting up regular builds for MADlib on builds.apache.org
> > > will require integration with Docker support.
> > >
> > > == Initial Committers ==
> > >   * Anirudh Kondaveeti
> > >   * Caleb Welton
> > >   * Frank McQuillan
> > >   * Gang Xiong
> > >   * Gautam Muralidhar
> > >   * Hitoshi Harada
> > >   * Hulya Emir-farinas
> > >   * Ian Huston
> > >   * KeeSiong Ng
> > >   * Noel Sio
> > >   * Rahul Iyer
> > >   * Rashmi Raghu
> > >   * Regunathan Radhakrishnan
> > >   * Ronert Obst
> > >   * Samuel Ziegler
> > >   * Sarah Aerni
> > >   * Srivatsan Ramanujam
> > >   * Woo Jae Jung
> > >   * Xixuan Feng
> > >   * Yu Yang
> > >   * Atri Sharma
> > >   * Greg Chase
> > >   * Chloe Jackson
> > >   * Roman Shaposhnik
> > >   * Vaibhav Gumashta
> > >   * Ted Dunning
> > >   * Konstantin Boudnik
> > >
> > > == Affiliations ==
> > >   * Hortonworks: Vaibhav Gumashta
> > >   * MapR: Ted Dunning
> > >   * WANDisco: Konstantin Boudnik
> > >   * Barclays:  Atri Sharma
> > >   * Pivotal: everyone else on this proposal
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > > Roman Shaposhnik
> > >
> > > === Nominated Mentors ===
> > >
> > > The initial mentors are listed below:
> > >   * Ted Dunning - Apache Member, MapR
> > >   * Konstantin Boudnik - Apache Member, WANDisco
> > >   * Roman Shaposhnik - Apache Member, Pivotal
> > >
> > > === Sponsoring Entity ===
> > > We would like to propose Apache incubator to sponsor this project.
> >
>

Re: [VOTE] Accept MADlib into the Apache Incubator

Reply via email to