+1 -----Original Message----- From: Jacques Nadeau [mailto:jacq...@apache.org] Sent: Thursday, May 26, 2016 8:26 AM To: general@incubator.apache.org Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org> wrote: > +1 > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hi all, > > > > following the discussion thread, I'm now calling a vote to accept > > CarbonData into the Incubator. > > > > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ > > ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > This vote is open for 72 hours. > > > > The proposal follows, you can also access the wiki page: > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > Thanks ! > > Regards > > JB > > > > = Apache CarbonData = > > > > == Abstract == > > > > Apache CarbonData is a new Apache Hadoop native file format for > > faster interactive query using advanced columnar storage, index, > > compression and encoding techniques to improve computing efficiency, > > in turn it will help speedup queries an order of magnitude faster > > over PetaBytes of data. > > > > CarbonData github address: > > https://github.com/HuaweiBigData/carbondata > > > > == Background == > > > > Huawei is an ICT solution provider, we are committed to enhancing > > customer experiences for telecom carriers, enterprises, and > > consumers on big data, In order to satisfy the following customer > > requirements, we created a new Hadoop native file format: > > > > * Support interactive OLAP-style query over big data in seconds. > > * Support fast query on individual record which require touching > > all fields. > > * Fast data loading speed and support incremental load in period > > of minutes. > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > * Support time based data retention. > > > > Based on these requirements, we investigated existing file formats > > in the Hadoop eco-system, but we could not find a suitable solution > > that satisfying requirements all at the same time, so we start > > designing CarbonData. > > > > == Rationale == > > > > CarbonData contains multiple modules, which are classified into two > > categories: > > > > 1. CarbonData File Format: which contains core implementation for > > file format such as > > columnar,index,dictionary,encoding+compression,API for reading/writing etc. > > 2. CarbonData integration with big data processing framework such > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to > > abstract the execution runtime. > > > > === CarbonData File Format === > > > > CarbonData file format is a columnar store in HDFS, it has many > > features that a modern columnar format has, such as splittable, > > compression schema ,complex data type etc. And CarbonData has > > following unique > > features: > > > > ==== Indexing ==== > > > > In order to support fast interactive query, CarbonData leverage > > indexing technology to reduce I/O scans. CarbonData files stores > > data along with index, the index is not stored separately but the > > CarbonData file itself contains the index. In current > > implementation, CarbonData supports 3 types of indexing: > > > > 1. Multi-dimensional Key (B+ Tree index) > > The Data block are written in sequence to the disk and within each > > data blocks each column block is written in sequence. Finally, the > > metadata block for the file is written with information about byte > > positions of each block in the file, Min-Max statistics index and > > the start and end MDK of each data block. Since, the entire data in > > the file is in sorted order, the start and end MDK of each data > > block can be used to construct a B+Tree and the file can be > > logically represented as a > > B+Tree with the data blocks as leaf nodes (on disk) and the > > B+remaining > > non-leaf nodes in memory. > > 2. Inverted index > > Inverted index is widely used in search engine. By using this > > index, it helps processing/query engine to do filtering inside one HDFS > > block. > > Furthermore, query acceleration for count distinct like operation is > > made possible when combining bitmap and inverted index in query time. > > 3. MinMax index > > For all columns, minmax index is created so that processing/query > > engine can skip scan that is not required. > > > > ==== Global Dictionary ==== > > > > Besides I/O reduction, CarbonData accelerates computation by using > > global dictionary, which enables processing/query engines to perform > > all processing on encoded data without having to convert the data > > (Late Materialization). We have observed dramatic performance > > improvement for OLAP analytic scenario where table contains many > > columns in string data type. The data is converted back to the user > > readable form just before processing/query engine returning results to user. > > > > ==== Column Group ==== > > > > Sometimes users want to perform processing/query on multi-columns in > > one table, for example, performing scan for individual record in > > troubleshooting scenario. In this case, row format is more efficient > > than columnar format since all columns will be touched by the workload. > > To accelerate this, CarbonData supports storing a group of column in > > row format, so data in column group is stored together and enable > > fast retrieval. > > > > ==== Optimized for multiple use cases ==== > > > > CarbonData indices and dictionary is highly configurable. To make > > storage optimized for different use cases, user can configure what > > to index, so user can decide and tune the format before loading data > > into CarbonData. > > > > For example > > > > || Use Case || Supporting Features || Interactive OLAP query || > > || Columnar format, Multi-dimensional Key (B+ > > Tree index), Minmax index, Inverted index || > > || High throughput scan || Global dictionary, Minmax index || Low > > || latency point query || Multi-dimensional Key (B+ Tree index), > > Partitioning || > > || Individual record query || Column group, Global dictionary || > > > > === BigData Processing Framework Integration === > > > > * CarbonData provides InputFormat/OutputFormat interfaces for > > Reading/Writing data from the CarbonData files and at the same time > > provides abstract API for processing data stored as Carbondata > > format with data processing framework. > > * CarbonData provides deep integration with Apache Spark including > > predicate push down, column pruning, aggregation push down etc. So > > users can use Spark SQL to connect and query from CarbonData. > > * CarbonData can integrate with various big data Query/Processing > > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. > > > > Example: > > > > > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/m > ain/scala/org/carbondata/examples/CarbonExample.scala > > > > == Initial Goals == > > > > Our initial goals are to bring CarbonData into the ASF, transition > > internal engineering processes into the open, and foster a > > collaborative development model according to the "Apache Way". > > > > == Current Status == > > > > CarbonData is production ready and already provide a large set of > features. > > The current license is already Apache 2.0. > > > > == Meritocracy == > > > > We intend to radically expand the initial developer and user > > community by running the project in accordance with the "Apache > > Way". Users and new contributors will be treated with respect and > > welcomed. By participating in the community and providing quality > > patches/support that move the project forward, they will earn merit. > > They also will be encouraged to provide non-code contributions > > (documentation, events, community management, etc.) and will gain > > merit for doing so. Those with a proven support and quality track > > record will be encouraged to become committers. > > > > == Community == > > > > If CarbonData is accepted for incubation, the primary initial goal > > is to build a large community. We really trust that CarbonData will > > become a key project for big data column-like platforms, and so, we > > bet on a large community of users and developers. > > > > == Known Risks == > > > > Development has been sponsored mostly by a one company.For the > > project to fully transition to the Apache Way governance model, > > development must shift towards the meritocracy-centric model of > > growing a community of contributors balanced with the needs for > > extreme stability and core implementation coherency. > > > > == Orphaned products == > > > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested > > interest in making CarbonData succeed by driving its close > > integration with sister ASF projects. We expect this to further > > reduces the risk of orphaning the product. > > > > == Inexperience with Open Source == > > > > Huawei has been developing and using open source software since a > > long time. Additionally, several ASF veterans agreed to mentor the > > project and are listed in this proposal. The project will rely on > > their guidance and collective wisdom to quickly transition the > > entire team of initial committers towards practicing the Apache Way. > > > > == Reliance on Salaried Developers == > > > > Most of the contributors are paid to work in big data space. While > > they might wander from their current employers, they are unlikely to > > venture far from their core expertises and thus will continue to be > > engaged with the project regardless of their current employers. > > > > == An Excessive Fascination with the Apache Brand == > > > > While we intend to leverage the Apache ‘branding’ when talking to > > other projects as testament of our project’s ‘neutrality’, we have > > no plans for making use of Apache brand in press releases nor > > posting billboards advertising acceptance of CarbonData into Apache > > Incubator. > > > > == Initial Source == > > > > https://github.com/HuaweiBigData/carbondata.git > > > > == External Dependencies == > > > > All external dependencies are licensed under an Apache 2.0 license > > or Apache-compatible license. As we grow the Carbondata community we > > will configure our build process to require and validate all > > contributions and dependencies are licensed under the Apache 2.0 > > license or are under an Apache-compatible license. > > > > * Apache Spark > > * Apache Hadoop > > * Apache Maven > > * Apache Commons > > * Apache Log4j > > * Apache Thrift > > * Apache Zookeeper > > * Scala > > * Snappy > > * Kettle (Pentaho) > > * Eigenbase > > * Fastutil > > * GSON > > * Jmockit > > * Junit > > > > == Required Resources == > > > > === Mailing lists === > > > > * priv...@carbondata.incubator.apache.org (moderated subscriptions) > > * comm...@carbondata.incubator.apache.org > > * d...@carbondata.incubator.apache.org > > * iss...@carbondata.incubator.apache.org > > > > === Git Repository === > > > > * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git > > > > === Issue Tracking === > > > > * JIRA Project CarbonData (CarbonData) > > > > === Initial Committers === > > > > * Liang Chenliang > > * Jean-Baptiste Onofré > > * Henry Saputra > > * Uma Maheswara Rao G > > * Jenny MA > > * Jacky Likun > > * Vimal Das Kammath > > * Jarray Qiuheng > > > > === Affiliations === > > > > * Huawei: Liang Chenliang > > * Talend: Jean-Baptiste Onofré > > * Ebay: Henry Saputra > > * Intel: Uma Maheswara Rao G > > > > === Sponsors === > > > > === Champion === > > > > * Jean-Baptiste Onofré - Apache Member > > > > === Mentors === > > > > * Henry Saputra (eBay) > > * Jean-Baptiste Onofré (Talend) > > * Uma Maheswara Rao G (Intel) > > > > === Sponsoring Entity === > > > > The Apache Incubator > > > > -------------------------------------------------------------------- > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > >