+1 (binding) On Wed, May 25, 2016 at 10:24 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi all, > > following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, > In order to satisfy the following customer requirements, we created a new > Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all > fields. > * Fast data loading speed and support incremental load in period of > minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features > that a modern columnar format has, such as splittable, compression schema > ,complex data type etc. And CarbonData has following unique features: > > ==== Indexing ==== > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types > of indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of each > block in the file, Min-Max statistics index and the start and end MDK of > each data block. Since, the entire data in the file is in sorted order, the > start and end MDK of each data block can be used to construct a B+Tree and > the file can be logically represented as a B+Tree with the data blocks as > leaf nodes (on disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it > helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is made > possible when combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine > can skip scan that is not required. > > ==== Global Dictionary ==== > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > > ==== Column Group ==== > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient than > columnar format since all columns will be touched by the workload. To > accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > > ==== Optimized for multiple use cases ==== > > CarbonData indices and dictionary is highly configurable. To make storage > optimized for different use cases, user can configure what to index, so > user can decide and tune the format before loading data into CarbonData. > > For example > > || Use Case || Supporting Features || > || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ > Tree index), Minmax index, Inverted index || > || High throughput scan || Global dictionary, Minmax index || > || Low latency point query || Multi-dimensional Key (B+ Tree index), > Partitioning || > || Individual record query || Column group, Global dictionary || > > === BigData Processing Framework Integration === > > * CarbonData provides InputFormat/OutputFormat interfaces for > Reading/Writing data from the CarbonData files and at the same time > provides abstract API for processing data stored as Carbondata format with > data processing framework. > * CarbonData provides deep integration with Apache Spark including > predicate push down, column pruning, aggregation push down etc. So users > can use Spark SQL to connect and query from CarbonData. > * CarbonData can integrate with various big data Query/Processing > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. > > Example: > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala > > == Initial Goals == > > Our initial goals are to bring CarbonData into the ASF, transition > internal engineering processes into the open, and foster a collaborative > development model according to the "Apache Way". > > == Current Status == > > CarbonData is production ready and already provide a large set of features. > The current license is already Apache 2.0. > > == Meritocracy == > > We intend to radically expand the initial developer and user community by > running the project in accordance with the "Apache Way". Users and new > contributors will be treated with respect and welcomed. By participating in > the community and providing quality patches/support that move the project > forward, they will earn merit. They also will be encouraged to provide > non-code contributions (documentation, events, community management, etc.) > and will gain merit for doing so. Those with a proven support and quality > track record will be encouraged to become committers. > > == Community == > > If CarbonData is accepted for incubation, the primary initial goal is to > build a large community. We really trust that CarbonData will become a key > project for big data column-like platforms, and so, we bet on a large > community of users and developers. > > == Known Risks == > > Development has been sponsored mostly by a one company.For the project to > fully transition to the Apache Way governance model, development must shift > towards the meritocracy-centric model of growing a community of > contributors balanced with the needs for extreme stability and core > implementation coherency. > > == Orphaned products == > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested > interest in making CarbonData succeed by driving its close integration with > sister ASF projects. We expect this to further reduces the risk of > orphaning the product. > > == Inexperience with Open Source == > > Huawei has been developing and using open source software since a long > time. Additionally, several ASF veterans agreed to mentor the project and > are listed in this proposal. The project will rely on their guidance and > collective wisdom to quickly transition the entire team of initial > committers towards practicing the Apache Way. > > == Reliance on Salaried Developers == > > Most of the contributors are paid to work in big data space. While they > might wander from their current employers, they are unlikely to venture far > from their core expertises and thus will continue to be engaged with the > project regardless of their current employers. > > == An Excessive Fascination with the Apache Brand == > > While we intend to leverage the Apache ‘branding’ when talking to other > projects as testament of our project’s ‘neutrality’, we have no plans for > making use of Apache brand in press releases nor posting billboards > advertising acceptance of CarbonData into Apache Incubator. > > == Initial Source == > > https://github.com/HuaweiBigData/carbondata.git > > == External Dependencies == > > All external dependencies are licensed under an Apache 2.0 license or > Apache-compatible license. As we grow the Carbondata community we will > configure our build process to require and validate all contributions > and dependencies are licensed under the Apache 2.0 license or are under > an Apache-compatible license. > > * Apache Spark > * Apache Hadoop > * Apache Maven > * Apache Commons > * Apache Log4j > * Apache Thrift > * Apache Zookeeper > * Scala > * Snappy > * Kettle (Pentaho) > * Eigenbase > * Fastutil > * GSON > * Jmockit > * Junit > > == Required Resources == > > === Mailing lists === > > * priv...@carbondata.incubator.apache.org (moderated subscriptions) > * comm...@carbondata.incubator.apache.org > * d...@carbondata.incubator.apache.org > * iss...@carbondata.incubator.apache.org > > === Git Repository === > > * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git > > === Issue Tracking === > > * JIRA Project CarbonData (CarbonData) > > === Initial Committers === > > * Liang Chenliang > * Jean-Baptiste Onofré > * Henry Saputra > * Uma Maheswara Rao G > * Jenny MA > * Jacky Likun > * Vimal Das Kammath > * Jarray Qiuheng > > === Affiliations === > > * Huawei: Liang Chenliang > * Talend: Jean-Baptiste Onofré > * Ebay: Henry Saputra > * Intel: Uma Maheswara Rao G > > === Sponsors === > > === Champion === > > * Jean-Baptiste Onofré - Apache Member > > === Mentors === > > * Henry Saputra (eBay) > * Jean-Baptiste Onofré (Talend) > * Uma Maheswara Rao G (Intel) > > === Sponsoring Entity === > > The Apache Incubator > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > -- Sergio Fernández Partner Technology Manager Redlink GmbH m: +43 6602747925 e: sergio.fernan...@redlink.co w: http://redlink.co