+1 (non-binding) On Thu, Feb 28, 2013 at 11:41 PM, Hyunsik Choi <hyun...@apache.org> wrote:
> Hi Folks, > > I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. > The vote will close on Mar 7 at 6:00 PM (PST). > > [] +1 Accept Tajo into the Apache incubator > [] +0 Don't care. > [] -1 Don't accept Tajo into the incubator because... > > Full proposal is pasted at the bottom on this email, and the corresponding > wiki is http://wiki.apache.org/incubator/TajoProposal. > > Only VOTEs from Incubator PMC members are binding, but all are welcome to > express their thoughts. > > Thanks, > Hyunsik > > PS: From the initial discussion, the main changes are that I've added 4 new > committers. Also, I've revised some description of Known Risks because the > initial committers have been diverse. > > ---------------- > Tajo Proposal > > = Abstract = > > Tajo is a distributed data warehouse system for Hadoop. > > > = Proposal = > > Tajo is a relational and distributed data warehouse system for Hadoop. Tajo > is designed for low-latency and scalable ad-hoc queries, online aggregation > and ETL on large-data sets by leveraging advanced database techniques. It > supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, > Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, > and it has its own query engine which allows direct control of distributed > execution and data flow. As a result, Tajo has a variety of query > evaluation strategies and more optimization opportunities. In addition, > Tajo will have a native columnar execution and and its optimizer. Tajo will > be an alternative choice to Hive/Pig on the top of MapReduce. > > > = Background = > > Big data analysis has gained much attention in the industrial. Open source > communities have proposed scalable and distributed solutions for ad-hoc > queries on big data. However, there is still room for improvement. Markets > need more faster and efficient solutions. Recently, some alternatives > (e.g., Cloudera's Impala and Amazon Redshift) have come out. > > > = Rationale = > > There are a variety of open source distributed execution engines (e.g., > hive, and pig) running on the top of MapReduce. They are limited by MR > framework. They cannot directly control distributed execution and data > flow, and they just use MR framework. So, they have limited query > evaluation strategies and optimization opportunities. It is hard for them > to be optimized for a certain type of data processing. > > > = Initial Goals = > > The initial goal is to write more documents to describe Tajo's internal. It > will be helpful to recruit more committers and to build a solid community. > Then, we will make milestones for short/long term plans. > > > = Current Status = > > Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., > selection, projection, group-by, join, union and sort) except for nested > queries. Tajo provides various row/column storage formats, such as CSV, > RowFile (a row-store file we have implemented), RCFile, and Trevni, and it > also has a rudimentary ETL feature to transform one data format to another > data format. In addition, Tajo provides hash and range repartitions. By > using both repartition methods, Tajo processes aggregation, join, and sort > queries over a number of cluster nodes. To evaluate the performance, we > have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. > > > == Meritocracy == > > We will discuss the milestone and the future plan in an open forum. We plan > to encourage an environment that supports a meritocracy. The contributors > will have different privileges according to their contributions. > > > == Community == > > Big data analysis has gained attention from open source communities, > industrial and academic areas. Some projects related to Hadoop already have > very large and active communities. We expect that Tajo also will establish > an active community. Since Tajo already works for some features and is in > the alpha stage, it will attract a large community soon. > > > == Core Developers == > > Core developers are a diverse group of developers, many of which are very > experienced in open source and the Apache Hadoop ecosystem. > > * Eli Reisman <ereisman AT apache DOT org> > > * Henry Saputra <hsaputra AT apache DOT org> > > * Hyunsik Choi <hyunsik AT apache DOT org> > > * Jae Hwa Jung <jhjung AT gruter DOT com> > > * Jihoon Son <ghoonson AT gmail DOT com> > > * Jin Ho Kim <jhkim AT gruter DOT com> > > * Roshan Sumbaly <rsumbaly AT gmail DOT com> > > * Sangwook Kim <swkim AT inervit DOT com> > > * Yi A Liu <yi DOT a DOT liu AT intel DOT com> > > > == Alignment == > > Tajo employs Apache Hadoop Yarn as a resource management platform for large > clusters. It uses HDFS as a primary storage layer. It already supports > Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In > addition, we have a plan to integrate Tajo with other products of Hadoop > ecosystem. Tajo's modules are well organized, and these modules can also be > used for other projects. > > > = Known Risks = > > == Orphaned Products == > > Most of codes have been developed by only two core developers, who are > Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However, > they are guaranteed to have enough time to develop Tajo for years. As you > can see the commit history, they have participated in this project for > about two years. In addition, the initial committers are diverse, and Tajo > has been supported by two IT companies in South Korea. So, the risk of > being orphaned is very low. Later, we will be eager to recruit additional > committers in order to eliminate this risk. > > > == Inexperience with Open Source == > > Most of the initial committers have experience working on open source > projects. In particular, Eli, Henry, and Hyunsik have experience as > committers and PMC members on other Apache projects. > > > == Homogeneous Developers == > > Although they are a diverse group of developers, what a half of core > developers are in South Korea may be a risk. This is because their offline > activities are limited due to their location. Since we surely recognize > this risk, we will write more complete documents and presentation materials > in order to disseminate Tajo's internal and users guide. In addition, to > mitigate this risk we will be eager to recruit additional committers around > the world. > > > == Reliance on Salaried Developers == > > It is expected that Tajo development will occur on both salaried time and > on volunteer time. Hyunsik and Jihoon belong to Database lab., Korea Univ. > They will be paid by the lab to contribute Tajo for years. Jin Ho and > Sangwook are paid by their employer to contribute to this project. Other > developers will contribute to this project on volunteer time. In addition, > we will be eager to recruit additional committers including salaried and > non-salaried developers. > > > == Relationships with Other Apache Products == > > Tajo has some overlapping function with Apache Incubator Drill. However, > Tajo is even more mature than Drill. In addition, there are some > significant differences. Drill is a distributed system specialized for > low-latency query processing by using column operations and intermediate > data streaming. Drill has very simple query optimizer. However, some > queries including big-big table join and sort are not available in that > manner. Drill will support some of query types. > > In contrast, Tajo has advanced query optimization system. Tajo mainly aims > at scalable and efficient processing on all query types. By using the query > optimizer, Tajo will only chase low latency query processing for some query > types that can be executed in online aggregation manner. > > Besides, Tez has some overlapping functions with Tajo. However, Tez is in > the pre-alpha stage and may be a prototype. When Tez becomes feasible, Tajo > could use Tez as an underlying framework according to the applicability. > However, Tajo will still use its row/native columnar execution engine and > its optimizer. Tajo may be potentially the first application of Tez. > > > == A Excessive Fascination with the Apache Brand == > > We believe that the Apache brand will help us to find contributors and to > grow the community. The community and development process will make this > project more stable and help establish ubiquitous APIs. In addition, Tajo > depends other project in Apache Hadoop ecosystem. We expect that > cooperative work occurs with other projects in the same place. > > > = Documentation = > > Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this > conference will be held in April 2013, we cannot publicly show the paper. > Instead, we attached some presentation material. Checkout this slide ( > http://www.slideshare.net/hyunsikchoi/tajo-intro) > > In addition, some documents (e.g., getting started) are available at > http://tajo-project.github.com/tajo/. > > > = Initial Source = > > The initial source code has been developed in the Database Lab. Korea Univ. > This is implemented in Java and has almost 100,000 lines except for parser > and protobuf generated codes. Currently, initial source code is already > available on GitHub at [[https://github.com/tajo-project/tajo]]. > > > = Source and Intellectual Property Submission Plan = > > We intend the entire code base to be licensed under the Apache License, > Version 2.0. > > > = External Dependencies = > > The required dependencies are all Apache compatible licenses. The following > components with non-Apache licenses are enumerated: > > * Google Guava > > * Google Protocol Buffer > > * Antlr > > * Mockito > > * JLine2 > > > = Cryptography = > > Tajo will depend on secure Hadoop that can optionally use Kerberos. > > > = Required Resources = > > == Mailling List == > > * tajo-private (with moderated subscriptions) > > * tajo-dev > > * tajo-commits > > > == Subversion Directory == > > https://git-wip-us.apache.org/repos/asf/tajo.git > > > == Issue Tracking == > > Jira Tajo (TAJO) > > > == Other Resources == > > * Continuous Integration > > * Jenkins > > * Wiki > > * http://wiki.apache.org/tajo > > > = Initial Committers = > > * Eli Reisman <ereisman AT apache DOT org> > > * Henry Saputra <hsaputra AT apache DOT org> > > * Hyunsik Choi <hyunsik AT apache DOT org> > > * Jae Hwa Jung <jhjung AT gruter DOT com> > > * Jihoon Son <ghoonson AT gmail DOT com> > > * Jin Ho Kim <jhkim AT gruter DOT com> > > * Roshan Sumbaly <rsumbaly AT gmail DOT com> > > * Sangwook Kim <swkim AT inervit DOT com> > > * Yi A Liu <yi DOT a DOT liu AT intel DOT com> > > > = Affiliations = > > * Eli Reisman (Hortonworks) > > * Henry Saputra (Platfora) > > * Hyunsik Choi (Database Lab., Korea University) > > * Jae Hwa Jung (Gruter) > > * Jihoon Son (Database Lab., Korea University) > > * Jin Ho Kim (Gruter) > > * Roshan Sumbaly (LinkedIn) > > * Sangwook Kim (Inervit) > > * Yi A Liu (Intel) > > > The nominated mentors are employees of NASA JPL, LinkedIn, and Hortonworks. > > * Chris Mattmann - NASA JPL > > * Jakob Homan - LinkedIn > > * Owen O'Malley - Hortonworks > > > = Sponsors = > > == Champion == > > * Jakob Homan <ghoman AT apache DOT org> > > > == Nominated Mentors == > > * Chris Mattmann <chris DOT a DOT mattmann AT jpl DOT nasa DOT gov> > > * Jakob Homan <jghoman AT apache DOT org> > > * Owen O'Malley <omalley AT apache DOT org> > > > == Sponsoring Entity == > > Apache Incubator >