Byung-Gon It looks like a good proposal. There are some minor edit I'd recommend you'd do:
- Use the same github URL consistently. - I just fixed the section of the proposal guide to include how to reference a git repository. This should help you and future proposed podlings get things going better. If you like - you already have 3 mentors, I'd be willing to step up and help mentor REEF as well. John On Fri, Aug 1, 2014 at 3:14 AM, Byung-Gon Chun <bgc...@gmail.com> wrote: > Hi everyone, > > I would like to propose REEF to be an Apache Incubator project. REEF is a > scale-out computing fabric that eases the development of Big Data > applications on top of resource managers such as Apache YARN and Mesos. > > The proposal is included in plain text below. I would also like to put this > on wiki but I don't have privileges to create wiki pages. > > I look forward to hearing everyone's thoughts and feedback! > > -Gon > > -- > Byung-Gon Chun > > > === > > # REEFProposal - Incubator > > > # Abstract > > REEF (Retainable Evaluator Execution Framework) is a scale-out > computing fabric that eases the development of Big Data applications > on top of resource managers such as Apache YARN and Mesos. > > > # Proposal > > REEF is a Big Data system that makes it easy to implement scalable, > fault-tolerant runtime environments for a range of data processing > models (e.g., graph processing and machine learning) on top of > resource managers such as Apache YARN and Mesos. REEF provides > capabilities to run multiple heterogeneous frameworks and workflows of > those efficiently. > > Additionally, REEF contains two libraries that are of independent > value: Wake is an event-based-programming framework inspired by Rx and > SEDA. Tang is a dependency injection framework inspired by Google > Guice, but designed specifically for configuring distributed systems. > > > # Background > > The resource management layer such as Apache YARN and Mesos has > emerged as a critical layer in the new scale-out data processing > stack; resource managers assume the responsibility of multiplexing a > cluster of shared-nothing machines across heterogeneous > applications. They operate behind an interface for leasing containers > - a slice of a machine’s resources - to computations in an elastic > fashion. However, building data processing frameworks directly on this > layer comes at a high cost: each framework must tackle the same > challenges (e.g., fault-tolerance, task scheduling and coordination) > and reimplement common mechanisms (e.g., caching, bulk transfers). > > REEF provides a reusable control-plane for scheduling and coordinating > task-level work on cluster resource managers. The REEF design enables > sophisticated optimizations, such as container re-use and data > caching, and facilitates workflows that span multiple > frameworks. Examples include pipelining data between different > operators in a relational system, retaining state across iterations in > iterative or recursive data flow, and passing the result of a > MapReduce job to a Machine Learning computation. > > > # Rationale > > Since REEF is a library that makes it easy to write distributed > applications on top of Apache YARN or Mesos, the Apache Software Foundation > is the perfect home for hosting REEF. > > > # Current Status > > REEF has been developed mostly by Microsoft, UCLA and the Seoul > National University. The REEF codebase is open-sourced under Apache > License 2.0 and is currently hosted in a public repository at > github.com. > > > # Meritocracy > > We plan to build a strong open community by following the Apache > meritocracy principles. We will work with those who contribute > significantly to the project and invite them to be its committers. > > > # Community > > REEF is currently being used internally at Microsoft. Also, SK > Telecom builds their data analytics infrastructure on top of REEF in > collaboration with Seoul National University. We hope to extend our > contributor base by becoming an Apache incubator project. REEF will > attract developers who are interested in creating common building > blocks for simplifying the development of large-scale big data > applications. > > > # Core Developers > > Core developers are engineers from Microsoft, Purestorage, UCB, UCLA, > UW and Seoul National University. > > > # Alignment > > REEF depends on many Apache projects and dependencies. REEF is built > on resource managers such as Apache YARN and Apache Mesos. REEF also > uses HDFS as a distributed storage layer. > > > # Known Risks > ## Orphaned Products > > The risk of REEF being orphaned is small because Microsoft products > are built on REEF. The core REEF developers continue to work on REEF > at Microsoft, UCLA, and Seoul National University. The REEF project is > gaining interest from other institutions to be used as their > infrastructure. > > ## Inexperience with Open Source > > Several core developers have experience with open source development. > REEF committers will be guided by the mentors with strong Apache open > source project backgrounds. > > ## Homogeneous Developers > > The initial committers include developers from several institutions > including Microsoft, Purestorage, UCB, UCLA, and Seoul National > University. > > ## Reliance on Salaried Developers > > Developers from Microsoft are paid to work on REEF. Since the work is > used internally at Microsoft, Microsoft will keep supporting the > developers to work on REEF. There are also engineers and graduate > students that contribute to REEF from UCLA, UCB, UW and Seoul National > University. We plan to attract active developers from other > institutions. > > ## Relationships with Other Apache Products > > Given REEF's position in the big data stack, there are three > relationships to consider: Projects that fit below, on top of, or > alongside REEF in the stack. > > ### Below REEF: Mesos and YARN > > REEF is designed to facilitate application development on top of > resource managers. Hence, its relationship with the aforementioned > resource managers is symbiotic by design. > > ### On Top of REEF > > Apache Spark, Giraph, MapReduce and Flink are only some of the > projects that logically belong at a higher layer of the big data stack > than REEF. Of course, none of these today actually are leveraging > REEF and had to each individually solve some of the issues REEF > addresses. It is our goal that REEF will help developers create > an even richer set of future big data frameworks. > > ### Alongside REEF > > Apache hosts several projects building intermediate, library layers on > top of a resource management platform. Twill, Slider, and Tez are > notable examples in the incubator. These projects share many > objectives with REEF (and each other). We expect these parallel > explorations to converge and differentiate within Apache, as the space > for distributed applications and deployment is too vast for a single > answer. > > Apache Twill and REEF both aim to simplify application development on > top of resource managers. However, REEF and Twill go about this in > different ways: Twill simplifies programming by exposing a programming > model, Java Threads. REEF on the other hand provides a set of common > building blocks (e.g., job coordination, state passing, cluster > membership) for building big data processing applications and > virtualizes underlying resources managers. None of this prescribes a > specific programming model. As such, REEF occupies a slot ever so > slightly below Twill in an architecture stack. > > Apache Slider is a framework to make it easy to deploy and manage > long-running static applications in a YARN cluster. The focus is to > adapt existing applications such as HBase and Accumulo to run on YARN > with little modification. Therefore, the goals of Slider and REEF are > different. > > Apache Tez is a project to develop a generic Directed Acyclic Graph (DAG) > processing framework with a reusable set of data processing primitives. > The initial focus is to provide improved data processing capabilities for > projects like Apache Hive, Apache Pig, and Cascading. Tez is still a single > framework for DAG processing. In contrast, REEF provides a generic > layer on which diverse computation models (DAG, ML, Graph processing, > and Interactive query processing) can be built. More importantly, > REEF provides a layer that facilitates inter-framework resource and > in-memory state use and virtualizes resource managers. Regarding > re-usable data processing primitives, Tez and REEF share the same > goal. We hope to collaborate on features which can be shared between > Tez and REEF. > > > ## An Excessive Fascination with the Apache Brand > > The Apache Software Foundation has a reputation of being the best place to > host open source projects. We believe that we will attract many developers > who want to contribute to innovating in the Big Data platform space by > joining the Apache Software Foundation. > > > # Documentation > > The current documentation for REEF is at > https://github.com/Microsoft-CISL/REEF as well as on > http://www.reef-project.org > > > # Initial Source > > The REEF codebase is currently hosted at > https://github.com/Microsoft-CISL/REEF. > > > # External Dependencies > > REEF makes extensive use of the vast array of Java libraries from the > Apache Software Foundation, namely: > > * avro (Apache 2.0) > * hadoop (Apache 2.0) > * hdfs (Apache 2.0) > * yarn (Apache 2.0) > * commons-cli (Apache 2.0) > * commons-configuration (Apache 2.0) > * commons-lang (Apache 2.0) > * commons-logging (Apache 2.0) > > To the best of our knowledge, the external dependencies of REEF are > distributed under Apache compatible licenses: > > * guava-libraries (Apache 2.0) > * protobuf (BSD) > * asm (BSD) > * netty (Apache 2.0) > * mockito (MIT) > * junit (EPL 1.0) > * slf4j (MIT) > > > # Cryptography > > REEF will depend on secure Hadoop, which can optionally use Kerberos. > > # Required Resources > > ## Mailing Lists > > * reef-private for private PMC discussions > * reef-dev for technical discussions among contributors and > notification about commits > > ## Subversion Directory > > The REEF team uses Git for source version control: > git://git.apache.org/reef > > ## Issue Tracking > > JIRA REEF (REEF) > > ## Other Resources > > Jenkins continuous integration testing > > # Initial Committers > > * Markus Weimer > * Sergiy Matusevych > * Julia Wang > * Shravan M Narayanamurthy > * Yingda Chen > * Tony Majestro > * Beysim Sezgin > * Boris Shulman > * Russell Sears > * Jung Ryong Lee > * You Sun Jung > * Dong Joon Hyun > * Josh Rosen > * Tyson Condie > * Brandon Myers > * Yunseong Lee > * Taegeon Um > * Youngseok Yang > * Brian Cho > * Byung-Gon Chun > > # Affiliations > > * Microsoft: > * Markus Weimer > * Sergiy Matusevych > * Julia Wang > * Shravan M Narayanamurthy > * Yingda Chen > * Tony Majestro > * Beysim Sezgin > * Boris Shulman > * Purestorage: > * Russell Sears > * SK Telecom: > * Jung Ryong Lee > * You Sun Jung > * Dong Joon Hyun > * University of California: > * Josh Rosen (Berkeley) > * Tyson Condie (LA) > * University of Washington: > * Brandon Myers > * Seoul National University: > * Yunseong Lee > * Taegeon Um > * Youngseok Yang > * Brian Cho > * Byung-Gon Chun > > > # Sponsors > > ## Champions > Chris Douglas <cdoug...@apache.org> > > ## Nominated Mentors > * Chris Mattmann <mattm...@apache.org> > * Ross Gardler <rgard...@apache.org> > * Owen O'Malley <omal...@apache.org> > > ## Sponsoring Entity > The Apache Incubator >