It's really nice work. +1 (non-binding)
On Thu, Jan 2, 2014 at 12:27 PM, Imesh Gunaratne <im...@apache.org> wrote: > +1 (non-binding) > > > On Wed, Jan 1, 2014 at 2:09 AM, Jakob Homan <jgho...@gmail.com> wrote: > >> Incubator- >> >> Following the discussion earlier, I'm calling a vote to accept DataFu as a >> new Incubator project. >> >> The proposal draft is available at: >> https://wiki.apache.org/incubator/DataFuProposal, and is also included >> below. >> >> Vote is open for at least 96h and closes at the earliest on 4 Jan 13:00 >> PDT. I'm letting the vote run an extra day as we're in the holiday season. >> >> [ ] +1 accept DataFu in the Incubator >> [ ] +/-0 >> [ ] -1 because... >> >> Here's my binding +1. >> -Jakob >> >> ------------------------------- >> Abstract >> >> DataFu makes it easier to solve data problems using Hadoop and higher level >> languages based on it. >> >> Proposal >> >> DataFu provides a collection of Hadoop MapReduce jobs and functions in >> higher level languages based on it to perform data analysis. It provides >> functions for common statistics tasks (e.g. quantiles, sampling), PageRank, >> stream sessionization, and set and bag operations. DataFu also provides >> Hadoop jobs for incremental data processing in MapReduce. >> >> Background >> >> DataFu began two years ago as set of UDFs developed internally at LinkedIn, >> coming from our desire to solve common problems with reusable components. >> Recognizing that the community could benefit from such a library, we added >> documentation, an extensive suite of unit tests, and open sourced the code. >> Since then there have been steady contributions to DataFu as we encountered >> common problems not yet solved by it. Others outside LinkedIn have >> contributed as well. More recently we recognized the challenges with >> efficient incremental processing of data in Hadoop and have contributed a >> set of Hadoop MapReduce jobs as a solution. >> >> DataFu began as a project at LinkedIn, but it has shown itself to be useful >> to other organizations and developers as well as they have faced similar >> problems. We would like to share DataFu with the ASF and begin developing a >> community of developers and users within Apache. >> >> Rationale >> >> There is a strong need for well tested libraries that help developers solve >> common data problems in Hadoop and higher level languages such as Pig, >> Hive, Crunch, Scalding, etc. >> >> Current Status >> >> Meritocracy >> >> Our intent with this incubator proposal is to start building a diverse >> developer community around DataFu following the Apache meritocracy model. >> Since DataFu was initially open sourced in 2011, it has received >> contributions from both within and outside LinkedIn. We plan to continue >> support for new contributors and work with those who contribute >> significantly to the project to make them committers. >> >> Community >> >> DataFu has been building a community of developers for two years. It began >> with contributors from LinkedIn and has received contributions from >> developers at Cloudera since very early on. It has been included included >> in Cloudera’s Hadoop Distribution and Apache Bigtop. We hope to extend our >> contributor base significantly and invite all those who are interested in >> solving large-scale data processing problems to participate. >> >> Core Developers >> >> DataFu has a strong base of developers at LinkedIn. Matthew Hayes initiated >> the project in 2011, and aside from continued contributions to DataFu has >> also contributed the sub-project Hourglass for incremental MapReduce >> processing. Separate from DataFu he has also open sourced the White >> Elephant project. Sam Shah contributed a significant portion of the >> original code and continues to contribute to the project. William Vaughan >> has been contributing regularly to DataFu for the past two years. Evion Kim >> has been contributing to DataFu for the past year. Xiangrui Meng recently >> contributed implementations of scalable sampling algorithms based on >> research from a paper he published. Chris Lloyd has provided some important >> bug fixes and unit tests. Mitul Tiwari has also contributed to DataFu. >> Mathieu Bastian has been developing MapReduce jobs that we hope to include >> in DataFu. In addition he also leads the open source Gephi project. >> >> Alignment >> >> The ASF is the natural choice to host the DataFu project as its goal of >> encouraging community-driven open-source projects fits with our vision for >> DataFu. Additionally, other projects DataFu integrates with, such as Apache >> Pig and Apache Hadoop, and in the future Apache Hive and Apache Crunch, are >> hosted by the ASF and we will benefit and provide benefit by close >> proximity to them. >> >> Known Risks >> >> Orphaned Products >> >> The core developers have been contributing to DataFu for the past two >> years. There is very little risk of DataFu being abandoned given its >> widespread use within LinkedIn. >> >> Inexperience with Open Source >> >> DataFu was started as an open source project in 2011 and has remained so >> for two years. Matt initiated the project, and additionally is the creator >> of the open source White Elephant project. He has also contributed patches >> to Apache Pig. Most recently he has released Hourglass as a sub-project of >> DataFu. Sam contributed much of the original code and continues to >> contribute to the project. Will has been contributing to DataFu since it >> was first open sourced. Evion has been contributing for the past year. >> Mathieu leads the open source Gephi project. Jakob has been actively >> involved with the ASF as a full-time Hadoop committer and PMC member. >> >> Homogeneous Developers >> >> The current core developers are all from LinkedIn. DataFu has also received >> contributions from other corporations such as Cloudera. Two of these >> developers are among the Initial Committers listed below. We hope to >> establish a developer community that includes contributors from several >> other corporations and we are actively encouraging new contributors via >> presentations and blog posts. >> >> Reliance on Salaried Developers >> >> The current core developers are salaried employees of LinkedIn, however >> they are not paid specifically to work on DataFu. Contributions to DataFu >> arise from the developers solving problems they encounter in their various >> projects. The purpose of DataFu is to share these solutions so that others >> may benefit and build a community of developers striving to solve common >> problems together. Furthermore, once the project has a community built >> around it, we expect to get committers, developers and contributions from >> outside the current core developers. >> >> Relationships with Other Apache Products >> >> DataFu is deeply integrated with Apache products. It began as a library of >> user-defined functions for Apache Pig. It has grown to also include Hadoop >> jobs for incremental data processing and in the future will include code >> for other higher level languages built on top of Apache Hadoop. >> >> An Excessive Obsession with the Apache Brand >> >> While we respect the reputation of the Apache brand and have no doubts that >> it will attract contributors and users, our interest is primarily to give >> DataFu a solid home as an open source project following an established >> development model. >> >> Documentation >> >> Information on DataFu can be found at: >> >> https://github.com/LinkedIn/DataFu/blob/master/README.md >> >> Initial Source >> >> The initial source is available at: >> >> https://github.com/LinkedIn/DataFu >> >> Source and Intellectual Property Submission Plan >> >> The DataFu library source code, available on GitHub. >> >> External Dependencies >> >> The initial source has the following external dependencies that are either >> included in the final DataFu library or required in order to use it: >> >> fastutil (Apache 2.0) >> joda-time (Apache 2.0) >> commons-math (Apache 2.0) >> guava (Apache 2.0) >> stream (Apache 2.0) >> jsr-305 (BSD) >> log4j (Apache 2.0) >> json (The JSON License) >> avro (Apache 2.0) >> >> In addition, the following external libraries are used either in building, >> developing, or testing the project: >> >> pig (Apache 2.0) >> hadoop (Apache 2.0) >> jline (BSD) >> antlr (BSD) >> commons-io (Apache 2.0) >> testng (Apache 2.0) >> maven (Apache 2.0) >> jsr-311 (CDDL-1.0) >> slf4j (MIT) >> eclipse (Eclipse Public License 1.0) >> autojar (GPLv2) >> jarjar (Apache 2.0) >> >> Cryptography >> >> DataFu has user-defined functions that use MD5 and SHA provided by Java’s >> java.security.MessageDigest. >> >> Required Resources >> >> Mailing Lists >> >> DataFu-private for private PMC discussions (with moderated subscriptions) >> DataFu-dev DataFu-commits >> >> Subversion Directory >> >> Git is the preferred source control system: git://git.apache.org/DataFu >> >> Issue Tracking >> >> JIRA DataFu (DataFu) >> >> Other Resources >> >> The existing code already has unit tests, so we would like a Hudson >> instance to run them whenever a new patch is submitted. This can be added >> after project creation. >> >> Initial Committers >> >> Matthew Hayes >> William Vaughan >> Evion Kim >> Sam Shah >> Xiangrui Meng >> Christopher Lloyd >> Mathieu Bastian >> Mitul Tiwari >> Josh Wills >> Jarek Jarcec Cecho >> >> Affiliations >> >> Matthew Hayes (LinkedIn) >> >> William Vaughan (LinkedIn) >> >> Evion Kim (LinkedIn) >> >> Sam Shah (LinkedIn) >> >> Xiangrui Meng (LinkedIn) >> >> Christopher Lloyd (LinkedIn) >> >> Mathieu Bastian (LinkedIn) >> >> Mitul Tiwari (LinkedIn) >> Josh Wills (Cloudera) >> Jarek Jarcec Cecho (Cloudera) >> >> Sponsors >> >> Champion >> >> Jakob Homan (Apache Member) >> >> Nominated Mentors >> >> Ashutosh Chauhan <hashutosh at apache dot org> >> >> Roman Shaposhnik <rvs at apache dot org> >> >> Ted Dunning <tdunning at apache dot org> >> >> Sponsoring Entity >> >> We are requesting the Incubator to sponsor this project. >> --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org