Great stuff, +1 On Thursday, November 20, 2014, Luke Han <luke...@gmail.com> wrote:
> Following the discussion earlier in the thread: > > > http://mail-archives.apache.org/mod_mbox/incubator-general/201411.mbox/%3ccakmqrob22+n+r++date33f3pcpyujhfoeaqrms3t-udjwk6...@mail.gmail.com%3e > > I would like to call a VOTE for accepting Kylin as a new incubator project. > > The proposal is available at: > https://wiki.apache.org/incubator/KylinProposal > > and posted the text of the proposal below also. > > Vote is open until 24th November 2014, 23:59:00 UTC > > [ ] +1 accept Kylin in the Incubator > [ ] ±0 > [ ] -1 because... > > > Thanks > Luke > > > Kylin Proposal > ============== > > # Abstract > > Kylin is a distributed and scalable OLAP engine built on Hadoop to > support extremely large datasets. > > # Proposal > > Kylin is an open source Distributed Analytics Engine that provides > multi-dimensional analysis (MOLAP) on Hadoop. Kylin is designed to > accelerate analytics on Hadoop by allowing the use of SQL-compatible > tools. Kylin provides a SQL interface and multi-dimensional analysis > (MOLAP) on Hadoop to support extremely large datasets and tightly > integrate with Hadoop ecosystem. > > ## Overview of Kylin > > Kylin platform has two parts of data processing and interactive: > First, Kylin will read data from source, Hive, and run a set of tasks > including Map Reduce job, shell script to pre-calcuate results for a > specified data model, then save the resulting OLAP cube into storage > such as HBase. Once these OLAP cubes are ready, a user can submit a > request from any SQL-based tool or third party applications to Kylin’s > REST server. The Server calls the Query Engine to determine if the > target dataset already exists. If so, the engine directly accesses the > target data in the form of a predefined cube, and returns the result > with sub-second latency. Otherwise, the engine is designed to route > non-matching queries to whichever SQL on Hadoop tool is already > available on a Hadoop cluster, such as Hive. > > Kylin platform includes: > > - Metadata Manager: Kylin is a metadata-driven application. The Kylin > Metadata Manager is the key component that manages all metadata stored > in Kylin including all cube metadata. All other components rely on the > Metadata Manager. > > - Job Engine: This engine is designed to handle all of the offline > jobs including shell script, Java API, and Map Reduce jobs. The Job > Engine manages and coordinates all of the jobs in Kylin to make sure > each job executes and handles failures. > > - Storage Engine: This engine manages the underlying storage – > specifically, the cuboids, which are stored as key-value pairs. The > Storage Engine uses HBase – the best solution from the Hadoop > ecosystem for leveraging an existing K-V system. Kylin can also be > extended to support other K-V systems, such as Redis. > > - Query Engine: Once the cube is ready, the Query Engine can receive > and parse user queries. It then interacts with other components to > return the results to the user. > > - REST Server: The REST Server is an entry point for applications to > develop against Kylin. Applications can submit queries, get results, > trigger cube build jobs, get metadata, get user privileges, and so on. > > - ODBC Driver: To support third-party tools and applications – such as > Tableau – we have built and open-sourced an ODBC Driver. The goal is > to make it easy for users to onboard. > > # Background > > The challenge we face at eBay is that our data volume is becoming > bigger and bigger while our user base is becoming more diverse. For > e.g. our business users and analysts consistently ask for minimal > latency when visualizing data on Tableau and Excel. So, we worked > closely with our internal analyst community and outlined the product > requirements for Kylin: > > - Sub-second query latency on billions of rows > - ANSI SQL availability for those using SQL-compatible tools > - Full OLAP capability to offer advanced functionality > - Support for high cardinality and very large dimensions > - High concurrency for thousands of users > - Distributed and scale-out architecture for analysis in the TB to PB size > range > > Existing SQL-on-Hadoop solutions commonly need to perform partial or > full table or file scans to compute the results of queries. The cost > of these large data scans can make many queries very slow (more than a > minute). The core idea of MOLAP (multi-dimensional OLAP) is to > pre-compute data along dimensions of interest and store resulting > aggregates as a "cube". MOLAP is much faster but is inflexible. We > realized that no existing product met our exact requirements > externally – especially in the open source Hadoop community. To meet > our emerging business needs, we built a platform from scratch to > support MOLAP for these business requirements and then to support more > others include ROLAP. With an excellent development team and several > pilot customers, we have been able to bring the Kylin platform into > production as well as open source it. > > # Rationale > > When data grows to petabyte scale, the process of pre-calculation of a > query takes a long time and costly and powerful hardware. However, > with the benefit of Hadoop’s distributed computing architecture, jobs > can leverage hundreds or thousands of Hadoop data nodes. There still > exists a big gap between the growing volume of data and interactive > analytics: > > - Existing Business Intelligence (OLAP) platforms cannot scale out to > support fast growing data. > - Existing SQL on Hadoop projects are not designed for OLAP use cases, > huge tables joins will always take long time to scan and calculate. > - No mature OLAP solution exists on Hadoop > > As mentioned in the background, the business requirements triggered by > increase in data volume drove eBay to invest in building a solution > from scratch to offer Analytics capability on Hadoop cluster. With > Hadoop’s power of distributed computing Kylin can perform > pre-calculations in parallel and merge the final results, thereby > significantly reducing the processing time. > > To serve queries by the analyst community, Kylin generates cuboids > with all possible combinations of dimensions, and calculate all > metrics at different levels. The cuboids are then integrated to form a > pre-calculated OLAP cube. All cuboids are key-value structured: keys > are composites formed from combinations of multiple dimensions and > values are aggregations results for that particular combination of > dimensions. Kylin uses HBase to store cubes. HBase is useful because > it supports efficient searches across ranges of data. > > # Current Status > > ## Meritocracy > > Kylin has been deployed in production at eBay and is processing > extremely large datasets. The platform has demonstrated great > performance benefits and has proved to be a better way for analysts to > leverage data on Hadoop with a more convenient approach using their > favorite tool. > > ## Community > > Kylin seeks to develop developer and user communities during incubation. > > ## Core Developers > > Kylin is currently being designed and developed by six engineers from > eBay Inc. – Jiang Xu, Luke Han, Yang Li, George Song, Hongbin Ma and > Xiaodong Duo. In addition, some outside contributors are actively > contributing in design and development. Among them, Julian Hyde from > Hortonworks is a very important contributor. All of these core > developers have deep expertise in Hadoop and the Hadoop Ecosystem in > general. > > ## Alignment > > The ASF is a natural host for Kylin given that it is already the home > of Hadoop, Pig, Hive, and other emerging cloud software projects. > Kylin was designed to offer OLAP capability on Hadoop from the > beginning in order to solve data access and analysis challenges in > Hadoop clusters. Kylin complements the existing Hadoop analytics area > by providing a comprehensive solution based on pre-computed views. > > In Kylin, we are leveraging an open-source dynamic data management > framework called Apache Calcite to parse SQL and plug in our code. > Apache Calcite was previously called Optiq, was originally authored by > Julian Hyde and is now an Apache Incubator project. > > # Known Risks > > ## Orphaned Products > > The core developers of Kylin team plan to work full time on this > project. There is very little risk of Kylin getting orphaned since at > least one large company (eBay) is extensively using it in their > production Hadoop clusters. For example, currently there are 3 use > cases with more that 12+Billion rows and 1000 activity requests per > day using Kylin in production. Furthermore, since Kylin was open > sourced at the beginning of October 2014, it has received more than > 280 stars and been forked nearly 100 times. Kylin has one major > release so far and and received 5 pull requests from contributors in > the first month pull requests from external sources in the last month, > which further demonstrates Kylin as a very active project. We plan to > extend and diversify this community further through Apache. > > ## Inexperience with Open Source > > The core developers are all active users and followers of open source. > They are already committers and contributors to the Kylin Github > project. All have been involved with the source code that has been > released under an open source license, and several of them also have > experience developing code in an open source environment. Though the > core set of Developers do not have Apache Open Source experience, > there are plans to onboard individuals with Apache open source > experience on to the project. > > ## Homogenous Developers > > The core developers include developers from eBay, Ctrip and > Hortonworks. Apache Incubation process encourages an open and diverse > meritocratic community. Apache Kylin has the required amount of > diversity with committers from three different organizations, but is > also aware that bulk of the commits come from a single entity. Kylin > intends to make every possible effort to build a diverse, vibrant and > involved community and has already received substantial interest from > various organizations > > ## Reliance on Salaried Developers > > eBay invested in Kylin as the OLAP solution on top of Hadoop clusters > and some of its key engineers are working full time on the project. In > addition, since there is a growing Big Data need for scalable OLAP > solutions on Hadoop, we look forward to other Apache developers and > researchers to contribute to the project. Additional contributors, > including Apache committers have plans to join this effort shortly. > Also key to addressing the risk associated with relying on Salaried > developers from a single entity is to increase the diversity of the > contributors and actively lobby for Domain experts in the BI space to > contribute. Apache Kylin intends to do this. One approach already > taken is to approach the Apache Drill project to explore possible > cooperation. > > ## Relationships with Other Apache Products > > Kylin has a strong relationship and dependency with Apache Hadoop > HBase, Hive and Calcite. Being part of Apache’s Incubation community, > could help with a closer collaboration among these four projects and > as well as others. > > Kylin is likely to have substantial value to Apache Drill due to the > common use of Calcite as a query optimization engine and similar > approaches between Kylin's approach to cubing and Drill's approach to > input sources. > > ## An Excessive Fascination with the Apache Brand > > Kylin is proposing to enter incubation at Apache in order to help > efforts to diversify the committer-base, not so much to capitalize on > the Apache brand. The Kylin project is in production use already > inside EBay, but is not expected to be an EBay product for external > customers. As such, the Kylin project is not seeking to use the Apache > brand as a marketing tool. > > # Documentation > > Information about Kylin can be found at > https://github.com/KylinOLAP/Kylin. The following links provide more > information about Kylin in open source: > > - Kylin web site: http://kylin.io > - Codebase at Github: https://github.com/KylinOLAP/Kylin > - Issue Tracking: https://github.com/KylinOLAP/Kylin/issues > - User community: https://groups.google.com/forum/#!forum/kylin-olap > > ## Initial Source > > Kylin has been under development since 2013 by a team of engineers at > eBay Inc. It is currently hosted on Github.com under an Apache license > at https://github.com/KylinOLAP/Kylin > > ## External Dependencies > > Kylin has the following external dependencies. > > * Basic > > - JDK 1.6+ > - Apache Maven > - JUnit > - DBUnit > - Log4j > - Slf4j > - Apache Commons > - Google Guava > - Jackson > > * Hadoop > > - Apache Hadoop > - Apache HBase > - Apache Hive > - Apache Zookeeper > - Apache Curator > > * Utility > > - H2 > - JSCH > > * REST Service > > - Spring > > * Query > > - Antlr > - Apache Calcite (formerly Optiq) > - Linq4j > > * Job > > - Quartz > > * Web build tool > > - NPM > - Grunt > - bower > > * Web > > - Angular JS > - jQuery > - Bootstrap > - D3 JS > - ACE > > ##Cryptography > > Kylin will eventually support encryption on the wire. This is not one > of the initial goals, and we do not expect Kylin to be a controlled > export item due to the use of encryption. Kylin supports but does not > require the Kerberos authentication mechanism to access secured Hadoop > services. > > # Required Resources > > ## Mailing List > > - kylin-private for private PMC discussions (with moderated subscriptions) > - kylin-dev > - kylin-commits > > ##Subversion Directory > > Git is the preferred source control system: git://git.apache.org/Kylin > > ## Issue Tracking > > JIRA Kylin (KYLIN) > > ## Other Resources > > The existing code already has unit tests so we will make use of > existing Apache continuous testing infrastructure. The resulting load > should not be very large. > > # Initial Committers > > - Jiang Xu < jiangxu.china at gmail dot com> > - Luke Han <lukhan at ebay dot com> > - Yang Li <yangli9 at ebay dot com> > - George Song <ysong1 at ebay dot com> > - Hongbin Ma <honma at ebay dot com> > - Xiaodong Duo < oranjedog at gmail dot com> > - Julian Hyde < jhyde at apache dot org > > - Ankur Bansal < abansal at ebay dot com> > > ## Affiliations > > The initial committers are employees of eBay Inc., Ctrip and > Hortonworks. The nominated mentors are employees of Hortonworks, MapR > Technologies and Pivotal. > > # Sponsors > > ## Champion > > - Owen O’Malley < omalley at apache dot org > > - Ted Dunning <tdunning at apache dot org> > > ## Nominated Mentors > > - Owen O’Malley < omalley at apache dot org > - Apache IPMC member, > Co-founder and Senior Architect, Hortonworks > - Ted Dunning < tdunning at apache dot org> - Apache IPMC member, > Chief Architect, MapR Technologies > - Henry Saputra <hsaputra at apache dot org> - Apache IPMC member, Pivotal > - Jacques Nadeau <jacques at apache dot org> (pending admission to > IPMC) - Apache Drill PMC Chair, MapR Technologies > > #Sponsoring Entity > > We are requesting the Incubator to sponsor this project. >