+1 On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi all, > > following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing > customer experiences for telecom carriers, enterprises, and consumers on > big data, In order to satisfy the following customer requirements, we > created a new Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all > fields. > * Fast data loading speed and support incremental load in period of > minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in > the Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract > the execution runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features > that a modern columnar format has, such as splittable, compression > schema ,complex data type etc. And CarbonData has following unique > features: > > ==== Indexing ==== > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 > types of indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each > data blocks each column block is written in sequence. Finally, the > metadata block for the file is written with information about byte > positions of each block in the file, Min-Max statistics index and the > start and end MDK of each data block. Since, the entire data in the file > is in sorted order, the start and end MDK of each data block can be used > to construct a B+Tree and the file can be logically represented as a > B+Tree with the data blocks as leaf nodes (on disk) and the remaining > non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, > it helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is > made possible when combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query > engine can skip scan that is not required. > > ==== Global Dictionary ==== > > Besides I/O reduction, CarbonData accelerates computation by using > global dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > > ==== Column Group ==== > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient > than columnar format since all columns will be touched by the workload. > To accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > > ==== Optimized for multiple use cases ==== > > CarbonData indices and dictionary is highly configurable. To make > storage optimized for different use cases, user can configure what to > index, so user can decide and tune the format before loading data into > CarbonData. > > For example > > || Use Case || Supporting Features || > || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ > Tree index), Minmax index, Inverted index || > || High throughput scan || Global dictionary, Minmax index || > || Low latency point query || Multi-dimensional Key (B+ Tree index), > Partitioning || > || Individual record query || Column group, Global dictionary || > > === BigData Processing Framework Integration === > > * CarbonData provides InputFormat/OutputFormat interfaces for > Reading/Writing data from the CarbonData files and at the same time > provides abstract API for processing data stored as Carbondata format > with data processing framework. > * CarbonData provides deep integration with Apache Spark including > predicate push down, column pruning, aggregation push down etc. So users > can use Spark SQL to connect and query from CarbonData. > * CarbonData can integrate with various big data Query/Processing > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. > > Example: > > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala > > == Initial Goals == > > Our initial goals are to bring CarbonData into the ASF, transition > internal engineering processes into the open, and foster a collaborative > development model according to the "Apache Way". > > == Current Status == > > CarbonData is production ready and already provide a large set of features. > The current license is already Apache 2.0. > > == Meritocracy == > > We intend to radically expand the initial developer and user community > by running the project in accordance with the "Apache Way". Users and > new contributors will be treated with respect and welcomed. By > participating in the community and providing quality patches/support > that move the project forward, they will earn merit. They also will be > encouraged to provide non-code contributions (documentation, events, > community management, etc.) and will gain merit for doing so. Those with > a proven support and quality track record will be encouraged to become > committers. > > == Community == > > If CarbonData is accepted for incubation, the primary initial goal is to > build a large community. We really trust that CarbonData will become a > key project for big data column-like platforms, and so, we bet on a > large community of users and developers. > > == Known Risks == > > Development has been sponsored mostly by a one company.For the project > to fully transition to the Apache Way governance model, development must > shift towards the meritocracy-centric model of growing a community of > contributors balanced with the needs for extreme stability and core > implementation coherency. > > == Orphaned products == > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested > interest in making CarbonData succeed by driving its close integration > with sister ASF projects. We expect this to further reduces the risk of > orphaning the product. > > == Inexperience with Open Source == > > Huawei has been developing and using open source software since a long > time. Additionally, several ASF veterans agreed to mentor the project > and are listed in this proposal. The project will rely on their guidance > and collective wisdom to quickly transition the entire team of initial > committers towards practicing the Apache Way. > > == Reliance on Salaried Developers == > > Most of the contributors are paid to work in big data space. While they > might wander from their current employers, they are unlikely to venture > far from their core expertises and thus will continue to be engaged with > the project regardless of their current employers. > > == An Excessive Fascination with the Apache Brand == > > While we intend to leverage the Apache ‘branding’ when talking to other > projects as testament of our project’s ‘neutrality’, we have no plans > for making use of Apache brand in press releases nor posting billboards > advertising acceptance of CarbonData into Apache Incubator. > > == Initial Source == > > https://github.com/HuaweiBigData/carbondata.git > > == External Dependencies == > > All external dependencies are licensed under an Apache 2.0 license or > Apache-compatible license. As we grow the Carbondata community we will > configure our build process to require and validate all contributions > and dependencies are licensed under the Apache 2.0 license or are under > an Apache-compatible license. > > * Apache Spark > * Apache Hadoop > * Apache Maven > * Apache Commons > * Apache Log4j > * Apache Thrift > * Apache Zookeeper > * Scala > * Snappy > * Kettle (Pentaho) > * Eigenbase > * Fastutil > * GSON > * Jmockit > * Junit > > == Required Resources == > > === Mailing lists === > > * priv...@carbondata.incubator.apache.org (moderated subscriptions) > * comm...@carbondata.incubator.apache.org > * d...@carbondata.incubator.apache.org > * iss...@carbondata.incubator.apache.org > > === Git Repository === > > * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git > > === Issue Tracking === > > * JIRA Project CarbonData (CarbonData) > > === Initial Committers === > > * Liang Chenliang > * Jean-Baptiste Onofré > * Henry Saputra > * Uma Maheswara Rao G > * Jenny MA > * Jacky Likun > * Vimal Das Kammath > * Jarray Qiuheng > > === Affiliations === > > * Huawei: Liang Chenliang > * Talend: Jean-Baptiste Onofré > * Ebay: Henry Saputra > * Intel: Uma Maheswara Rao G > > === Sponsors === > > === Champion === > > * Jean-Baptiste Onofré - Apache Member > > === Mentors === > > * Henry Saputra (eBay) > * Jean-Baptiste Onofré (Talend) > * Uma Maheswara Rao G (Intel) > > === Sponsoring Entity === > > The Apache Incubator > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >