+1 (non-binding) On Thu, Sep 10, 2015 at 3:57 PM, Caleb Welton <cwel...@pivotal.io> wrote:
> +1 (non-binding) > > On Thu, Sep 10, 2015 at 12:53 PM, Rahul Iyer <rahulri...@gmail.com> wrote: > > > +1 (non-binding) > > > > On Wed, Sep 09, 2015 at 07:37PM, Roman Shaposhnik wrote: > > > > > > Following the discussion earlier: > > > http://s.apache.org/TE6 > > > > > > I would like to call a VOTE for accepting > > > MADlib community as a new ASF incubator > > > project. > > > > > > The proposal is available at: > > > https://wiki.apache.org/incubator/MADlibProposal > > > and is also included at the bottom of this email. > > > > > > Vote is open until at least Mon, 14 September 2015, 23:59:00 PST > > > > > > [ ] +1 accept MADlib into the Apache Incubator > > > [ ] ±0 > > > [ ] -1 because... > > > > > > Thanks, > > > Roman. > > > > > > == Abstract == > > > MADlib is an open-source library (licensed under 2-clause BSD license) > > > for scalable in-database analytics. It provides data-parallel > > > implementations of mathematical, statistical and machine learning > > > methods for structured and unstructured data. The MADlib mission is to > > > foster widespread development of scalable analytic skills, by > > > harnessing efforts from commercial practice, academic research, and > > > open source development. > > > > > > MADlib occupies a unique niche in the realm of data science and > > > machine learning libraries since its SQL APIs can allow it to work on > > > a wide range of data stores and SQL engines. > > > > > > == Proposal == > > > The current open source community behind MADlib feels that aligning > > > itself with HAWQ's community, governance model, infrastructure and > > > roadmap will allow the project to accelerate adoption and community > > > growth. Given HAWQ's trajectory of entering Apache Software Foundation > > > family as an Incubating project, we feel that the best course of > > > action for MADlib is to follow a similar route. > > > > > > MADlib and HAWQ are complementary technologies in that MADlib > > > in-database analytical functions can run within the HAWQ execution > > > engine. (MADlib also runs on Greenplum Database and PostgreSQL today.) > > > It is expected that contributors to MADlib will be cognizant of the > > > HAWQ ASF project and may contribute to it as well. In short, > > > collaboration between the two communities will make both projects more > > > vibrant and advance the respective technologies in potentially novel > > > directions. > > > > > > Contributors may also look at the HAWQ project as a starting port for > > > ports to other parallel database engines. This proposal highly > > > encourages this type of work as it would help to further realize the > > > original cross-platform goal of MADlib as envisioned by its > > > originators. > > > > > > Thus, the goal of this proposal is to bring the existing MADlib open > > > source community into ASF, change the project's governance model to > > > the "Apache Way" and transition the project's codebase and > > > infrastructure into ASF INFRA. The community has agreed to transfer > > > the brand name "MADlib" to Apache Software Foundation as well. > > > > > > Pivotal Inc. on behalf of the MADlib open source community is > > > submitting this proposal to transition source code and associated > > > artifacts (documentation, web site content, wiki, etc.) to the Apache > > > Software Foundation Incubator under the Apache License, Version 2.0 > > > and is asking Incubator PMC to established a MADlib incubating > > > project. > > > > > > Currently MADlib uses a few category X licensed software tools during > > > its build (mostly for generating documentation): > > > * doxypy 0.4.2 (GPL) > > > * doxygen 1.8.4 (GPL) > > > * TikZ-UML > > > * bison 2.4 (GPL, with an exception for generated output) > > > We feel that this usage is compatible with an overall project licensed > > > under the ALv2 and don't anticipate any changes. > > > Our usage of LGPL library cern_root-5.34 is expected to go away since > > > the 2 cern modules used are being entirely re-written > > > in MADlib > > > > > > Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into > > > its binary artifact seems to be consistent with > > > ASF recommendation for managing "weak copyleft" dependencies. > > > > > > > > > == Background == > > > MADlib grew out of discussions between database engine developers, > > > data scientists, IT architects and academics interested in new > > > approaches to scalable, sophisticated in-database analytics. These > > > discussions were written up in a paper in VLDB 2009 that coined the > > > term “MAD Skills” for data analysis > > > (http://dl.acm.org/citation.cfm?id=1687576). The MADlib software > > > project began the following year as a collaboration between > > > researchers at UC Berkeley and engineers and data scientists at > > > Pivotal (former EMC/Greenplum). > > > > > > The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the > > > University of Wisconsin, and the University of Florida. The project > > > was publicly documented in a paper at VLDB 2012 > > > (http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf). Today > > > MADlib has contributors from around the world including both > > > individuals and institutions. For example, recent contributions have > > > come from Pivotal, Stanford University, and the University of Illinois > > > at Chicago. > > > > > > MADlib was conceived from the outset as a free, open source library > > > for all to use and contribute to. Since its inception, the community > > > has steadily added new methods in the areas of mathematics, > > > statistics, machine learning, and data transformation. The current > > > library includes over 30 principle algorithms as well as many > > > additional operators and utility functions. > > > > > > The methods in MADlib are designed both for in- or out-of-core > > > execution, and for the shared-nothing, scale-out parallelism offered > > > by modern parallel database engines, ensuring that computation is done > > > close to the data. The core functionality is written in declarative > > > SQL statements, which orchestrate data movement to and from disk, and > > > across networked machines. Single-node inner loops take advantage of > > > SQL extensibility to call out to high performance math libraries in > > > user-defined scalar and aggregate functions. At the highest level, > > > tasks that require iteration and/or structure definition are coded in > > > Python driver routines, which are used only to kick off the data-rich > > > computations that happen within the database engine. > > > > > > The first platforms supported by MADlib were Greenplum Database and > > > PostgreSQL. With the development of HAWQ SQL-on-Hadoop technology by > > > Pivotal, MADlib offers a way to perform predictive analytics on very > > > large data sets stored on a Hadoop cluster. > > > > > > Today, MADlib is in active development and is deployed on a wide > > > variety of industry and academic projects across many different > > > verticals. > > > > > > == Rationale == > > > Enterprises today are seeing the value of landing very large > > > quantities of data in Hadoop clusters with the goal improving their > > > products and processes. With the proliferation of increasingly > > > sophisticated SQL-on-Hadoop technologies such as HAWQ, analysts can > > > use the familiar SQL language to query this data at scale. This > > > effectively opens the door to Hadoop in the enterprise. > > > > > > Adding SQL-based predictive analytics like MADlib to the equation > > > enables organizations to reason across large data sets without > > > resorting to sampling, which has been a traditional approach when > > > confronted with scale problems. Operating on all of the data with > > > MADlib results in more robust and accurate models. > > > > > > Since MADlib is a SQL-based interface, organizations do not need to > > > re-train their teams on an unfamiliar programming language since SQL > > > skills are ubiquitous in today's enterprises. > > > > > > Given the high velocity of innovation happening in the underlying > > > Hadoop ecosystem, any SQL-based predictive analytics technology that > > > plays in this ecosystem must be commensurately agile to keep up with > > > the community. We strongly believe that in the Big Data space, this > > > can be optimally achieved through a vibrant, diverse, self-governed > > > community collectively innovating around a single codebase while at > > > the same time cross-pollinating with various other data management > > > communities. Apache Software Foundation is the ideal place to meet > > > those ambitious goals. > > > > > > == Initial Goals == > > > Our initial goals are to bring MADlib into the ASF, transition the > > > engineering and governance processes to be in accordance with the > > > "Apache Way" and foster a collaborative development model closely > > > aligned with that of HAWQ. > > > > > > Another important goal is encouraging efforts to port to other > > > execution engines. > > > > > > The MADlib project will continue developing new functionality in an > > > open, community-driven way. We envision accelerating innovation under > > > ASF governance, in order to meet the requirements of a wide variety of > > > predictive analytics use cases. > > > > > > We will also require transitioning of existing project infrastructure > > > (source code, JIRA, mailing list) to the ASF infrastructure. > > > > > > == Current Status == > > > Currently, the project is available at http://madlib.net/. The > > > codebase is licensed under the a 2-clause BSD license. Our current > > > governance model could be described as a "benevolent dictator" one. As > > > stated above, the existing MADlib community feels that closer > > > alignment with HAWQ community, infrastructure and the governance model > > > as it is being proposed to ASF will allow MADlib project to thrive > > > much more compared to relative isolation from HAWQ. > > > > > > === Meritocracy === > > > Our proposed list of initial committers include the current MADlib R&D > > > team at Pivotal and existing active members of the open source > > > project. This group will form a base for the broader community we will > > > invite to collaborate on the codebase. We intend to radically expand > > > the initial developer and user community by running the project in > > > accordance with the "Apache Way". Users and new contributors will be > > > treated with respect and welcomed. By participating in the community > > > and providing quality patches/support that move the project forward, > > > they will earn merit. They also will be encouraged to provide non-code > > > contributions (documentation, events, community management, etc.) and > > > will gain merit for doing so. Those with a proven support and quality > > > track record will be encouraged to become committers. > > > > > > === Community === > > > If MADlib is accepted for incubation, the primary initial goal will be > > > transitioning the core community towards embracing the Apache Way of > > > project governance. We would solicit major existing contributors to > > > become committers on the project from the start. > > > > > > === Core Developers === > > > MADlib core developers are skilled in working as part of openly > > > governed communities. That said, most of the core developers are > > > currently NOT affiliated with the ASF and would require new ICLAs > > > before committing to the project. > > > > > > === Alignment === > > > The following existing ASF projects can be considered when reviewing > > > the MADlib proposal: > > > > > > Apache Mahout project's goal is to build an environment for quickly > > > creating scalable performant machine learning applications. Apache > > > Mahout is, perhaps, the oldest machine learning library in Hadoop > > > ecosystem. The three major components of Mahout are an environment for > > > building scalable algorithms, many new Scala + Spark (H2O in progress) > > > algorithms, and Mahout's mature Hadoop MapReduce algorithms. We see > > > the two projects benefiting from each other's experience of > > > implementing similar classes of algorithms and look forward to a > > > fruitful exchange of ideas between the two communities. > > > > > > Apache Spark is a fast engine for processing large datasets, typically > > > from a Hadoop cluster, and performing batch, streaming, interactive, > > > or machine learning workloads. Recently, Apache Spark has embraced > > > SQL-like APIs around DataFrames at its core. Because of that we would > > > expect a level of collaboration between the two projects. Spark > > > project also contains a library (MLlib) that is the closest cousin to > > > MADlib. MLlib is Apache Spark's scalable machine learning library. We > > > see the two projects benefiting from each other's experience of > > > implementing similar classes of algorithms and look forward to a > > > fruitful exchange of ideas between the two communities. > > > > > > Apache Hive is a data warehouse software that facilitates querying and > > > managing large datasets residing in distributed storage. Hive provides > > > a mechanism to project structure onto this data and query the data > > > using a SQL-like language called HiveQL. We see a potential for MADlib > > > to leverage Hive as a backend the same way it currently leverages > > > PostgreSQL-derived SQL backends. This could be especially useful for > > > longer running algorithms. > > > > > > Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL and > > > Cloud Storage. We see a potential for MADlib to leverage Drill as a > > > backend the same way it currently leverages PostgreSQL-derived SQL > > > backends. This could be especially useful for analyzing data coming > > > from heterogenous sources and federated by the Drill engine. > > > > > > == Known Risks == > > > Development has been sponsored mostly by a single company (or its > > > predecessors) thus far and coordinated mainly by the core Pivotal R&D > > > team. > > > > > > So far, the project's governance model has explicitly been a > > > "benevolent dictator" one. For the project to fully transition to the > > > "Apache Way", development must shift towards the meritocracy-centric > > > model of growing a community of contributors balanced with the needs > > > for extreme stability and core implementation coherency. > > > > > > === Orphaned products === > > > The community proposing MADlib for incubation is an independent open > > > source community. Even though Pivotal happens to be the biggest > > > corporate sponsor of the project (by means of employing the core team) > > > the community goes beyond those affiliated with Pivotal. On top of > > > that, Pivotal is fully committed to maintain its position as one of > > > the leading providers of SQL-based analytics aimed squarely at data > > > scientists. MADlib is the only game in town that can leverage SQL APIs > > > ranging from traditional RDBMS technology all the way to data > > > warehousing (Pivotal Greenplum Database) and into SQL-on-Hadoop > > > (HAWQ). Moreover, Pivotal has a vested interest in making MADlib > > > succeed by driving its close integration with sister ASF projects. We > > > expect this to further reduces the risk of orphaning the product. > > > > > > Even in the absence of support by a particular vendor such as Pivotal, > > > and in a worst-case scenario where HAWQ and Greenplum Database fail to > > > gain traction in OSS, the existence of an established PostgreSQL OSS > > > project means there’s will still be a working stack for MADlib. > > > > > > === Inexperience with Open Source === > > > MADlib has been an open source project from the outset. All developers > > > working on the project (regardless of their employment affiliation) > > > did so completely in the open. While the governance model of MADlib > > > has been more of a benevolent dictator model, the project has always > > > been receptive to accepting contributions from all sources and > > > including them in future releases based on thorough code review, > > > testing, and compliance with the project’s coding best practices. > > > > > > === Homogeneous Developers === > > > While most of the initial committers are employed by Pivotal, there's > > > still a healthy level of interest coming from academia. On top of that > > > we expect to spark curiosity in sister ASF projects and attract > > > developers unaffiliated with Pivotal. Finally, MADlib is being used > > > extensively whenever Pivotal engages with customers on data science > > > projects. This typically means that the skills remain within a > > > customer organization which further increases the chance of turning > > > customer data scientists into MADlib contributors. > > > > > > === Reliance on Salaried Developers === > > > A large percentage of the contributors are paid to work in the Big > > > Data space. While they might wander from their current employers, they > > > are unlikely to venture far from their core expertise and thus will > > > continue to be engaged with the project regardless of their current > > > employers. In addition, the project is still enjoying popularity in > > > academic circles and we hope that will help mitigate reliance on > > > salaried developers as well. > > > > > > === Relationships with Other Apache Products === > > > As mentioned in the Alignment section, MADlib may consider various > > > degrees of integration and code exchange with Apache Spark (MLlib), > > > Apache Mahout, Apache Hive and Apache Drill projects. We expect > > > integration points to be inside and outside the project. We look > > > forward to collaborating with these communities as well as other > > > communities under the Apache umbrella. > > > > > > === An Excessive Fascination with the Apache Brand === > > > While we intend to leverage the Apache "brand" when talking to other > > > projects as a testament to our project’s neutrality, we have no plans > > > for making use of the Apache brand in press releases nor posting > > > billboards advertising acceptance of MADlib into Apache Incubator. > > > > > > == Documentation == > > > The documentation is currently available at: > > https://github.com/madlib/frontpage > > > > > > The documentation is currently licensed under 2-clause BSD license. > > > > > > == Initial Source == > > > Initial source code is available at: > > > * MADlib: https://github.com/madlib/madlib > > > * Testsuite: https://github.com/madlib/testsuite > > > * Contributors: https://github.com/madlib/contrib > > > > > > The code is currently licensed under 2-clause BSD license. > > > > > > == Source and Intellectual Property Submission Plan == > > > As soon as MADlib is approved to join the Incubator, the source code > > > will be transitioned via the Software Grant Agreement onto ASF > > > infrastructure and in turn made available under the Apache License, > > > version 2.0. We know of no legal encumbrances that would inhibit the > > > transfer of source code to the ASF. > > > > > > == External Dependencies == > > > > > > Runtime dependencies: > > > * boost-1.47.0 (Boost Software License) > > > * _m_widen_init (MIT for this subcomponent of GCC) > > > * python-argparse-1.2.1 (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1) > > > * pyyaml-3.10 (MIT license) > > > * cern_root-5.34 (LGPL, however this dependency will be removed > > > since the 2 cern modules used are being entirely re-written in MADlib) > > > * eigen-3.2.2 (Mozilla Public License) > > > * pyxb-1.2.4 (Apache license version 2) > > > * python (Python Software Foundation License Version 2) > > > * mathjax-2.5 (Apache license version 2) > > > > > > Build only dependencies: > > > * doxypy-0.4.2 (GPL) > > > * cmake-2.8.4 (BSD 3-clause License) > > > * doxygen >= 1.8.4 (GPL) > > > * flex >= 2.5.33 (BSD) > > > * bison >= 2.4 (GPL) > > > * latex (LaTeX Project Public License) > > > * TikZ-UML (no license information) > > > > > > Cryptography > > > * N/A > > > > > > == Required Resources == > > > > > > === Mailing lists === > > > * priv...@madlib.incubator.apache.org (moderated subscriptions) > > > * comm...@madlib.incubator.apache.org > > > * d...@madlib.incubator.apache.org > > > * iss...@madlib.incubator.apache.org > > > * u...@madlib.incubator.apache.org > > > > > > === Git Repository === > > > https://git-wip-us.apache.org/repos/asf/incubator-madlib.git > > > > > > === Issue Tracking === > > > JIRA Project MADlib (MADLIB) > > > > > > We will also request migration of our current JIRA available at > > > http://jira.madlib.net/ > > > > > > === Other Resources === > > > > > > Means of setting up regular builds for MADlib on builds.apache.org > > > will require integration with Docker support. > > > > > > == Initial Committers == > > > * Anirudh Kondaveeti > > > * Caleb Welton > > > * Frank McQuillan > > > * Gang Xiong > > > * Gautam Muralidhar > > > * Hitoshi Harada > > > * Hulya Emir-farinas > > > * Ian Huston > > > * KeeSiong Ng > > > * Noel Sio > > > * Rahul Iyer > > > * Rashmi Raghu > > > * Regunathan Radhakrishnan > > > * Ronert Obst > > > * Samuel Ziegler > > > * Sarah Aerni > > > * Srivatsan Ramanujam > > > * Woo Jae Jung > > > * Xixuan Feng > > > * Yu Yang > > > * Atri Sharma > > > * Greg Chase > > > * Chloe Jackson > > > * Roman Shaposhnik > > > * Vaibhav Gumashta > > > * Ted Dunning > > > * Konstantin Boudnik > > > > > > == Affiliations == > > > * Hortonworks: Vaibhav Gumashta > > > * MapR: Ted Dunning > > > * WANDisco: Konstantin Boudnik > > > * Barclays: Atri Sharma > > > * Pivotal: everyone else on this proposal > > > > > > == Sponsors == > > > > > > === Champion === > > > Roman Shaposhnik > > > > > > === Nominated Mentors === > > > > > > The initial mentors are listed below: > > > * Ted Dunning - Apache Member, MapR > > > * Konstantin Boudnik - Apache Member, WANDisco > > > * Roman Shaposhnik - Apache Member, Pivotal > > > > > > === Sponsoring Entity === > > > We would like to propose Apache incubator to sponsor this project. > > >