+1 (binding) On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org> wrote:
> +1 > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hi all, > > > > following the discussion thread, I'm now calling a vote to accept > > CarbonData into the Incubator. > > > > [ ] +1 Accept CarbonData into the Apache Incubator > > [ ] +0 Abstain > > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > This vote is open for 72 hours. > > > > The proposal follows, you can also access the wiki page: > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > Thanks ! > > Regards > > JB > > > > = Apache CarbonData = > > > > == Abstract == > > > > Apache CarbonData is a new Apache Hadoop native file format for faster > > interactive > > query using advanced columnar storage, index, compression and encoding > > techniques > > to improve computing efficiency, in turn it will help speedup queries an > > order of > > magnitude faster over PetaBytes of data. > > > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > > > == Background == > > > > Huawei is an ICT solution provider, we are committed to enhancing > > customer experiences for telecom carriers, enterprises, and consumers on > > big data, In order to satisfy the following customer requirements, we > > created a new Hadoop native file format: > > > > * Support interactive OLAP-style query over big data in seconds. > > * Support fast query on individual record which require touching all > > fields. > > * Fast data loading speed and support incremental load in period of > > minutes. > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > * Support time based data retention. > > > > Based on these requirements, we investigated existing file formats in > > the Hadoop eco-system, but we could not find a suitable solution that > > satisfying requirements all at the same time, so we start designing > > CarbonData. > > > > == Rationale == > > > > CarbonData contains multiple modules, which are classified into two > > categories: > > > > 1. CarbonData File Format: which contains core implementation for file > > format such as columnar,index,dictionary,encoding+compression,API for > > reading/writing etc. > > 2. CarbonData integration with big data processing framework such as > > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract > > the execution runtime. > > > > === CarbonData File Format === > > > > CarbonData file format is a columnar store in HDFS, it has many features > > that a modern columnar format has, such as splittable, compression > > schema ,complex data type etc. And CarbonData has following unique > > features: > > > > ==== Indexing ==== > > > > In order to support fast interactive query, CarbonData leverage indexing > > technology to reduce I/O scans. CarbonData files stores data along with > > index, the index is not stored separately but the CarbonData file itself > > contains the index. In current implementation, CarbonData supports 3 > > types of indexing: > > > > 1. Multi-dimensional Key (B+ Tree index) > > The Data block are written in sequence to the disk and within each > > data blocks each column block is written in sequence. Finally, the > > metadata block for the file is written with information about byte > > positions of each block in the file, Min-Max statistics index and the > > start and end MDK of each data block. Since, the entire data in the file > > is in sorted order, the start and end MDK of each data block can be used > > to construct a B+Tree and the file can be logically represented as a > > B+Tree with the data blocks as leaf nodes (on disk) and the remaining > > non-leaf nodes in memory. > > 2. Inverted index > > Inverted index is widely used in search engine. By using this index, > > it helps processing/query engine to do filtering inside one HDFS block. > > Furthermore, query acceleration for count distinct like operation is > > made possible when combining bitmap and inverted index in query time. > > 3. MinMax index > > For all columns, minmax index is created so that processing/query > > engine can skip scan that is not required. > > > > ==== Global Dictionary ==== > > > > Besides I/O reduction, CarbonData accelerates computation by using > > global dictionary, which enables processing/query engines to perform all > > processing on encoded data without having to convert the data (Late > > Materialization). We have observed dramatic performance improvement for > > OLAP analytic scenario where table contains many columns in string data > > type. The data is converted back to the user readable form just before > > processing/query engine returning results to user. > > > > ==== Column Group ==== > > > > Sometimes users want to perform processing/query on multi-columns in one > > table, for example, performing scan for individual record in > > troubleshooting scenario. In this case, row format is more efficient > > than columnar format since all columns will be touched by the workload. > > To accelerate this, CarbonData supports storing a group of column in row > > format, so data in column group is stored together and enable fast > > retrieval. > > > > ==== Optimized for multiple use cases ==== > > > > CarbonData indices and dictionary is highly configurable. To make > > storage optimized for different use cases, user can configure what to > > index, so user can decide and tune the format before loading data into > > CarbonData. > > > > For example > > > > || Use Case || Supporting Features || > > || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ > > Tree index), Minmax index, Inverted index || > > || High throughput scan || Global dictionary, Minmax index || > > || Low latency point query || Multi-dimensional Key (B+ Tree index), > > Partitioning || > > || Individual record query || Column group, Global dictionary || > > > > === BigData Processing Framework Integration === > > > > * CarbonData provides InputFormat/OutputFormat interfaces for > > Reading/Writing data from the CarbonData files and at the same time > > provides abstract API for processing data stored as Carbondata format > > with data processing framework. > > * CarbonData provides deep integration with Apache Spark including > > predicate push down, column pruning, aggregation push down etc. So users > > can use Spark SQL to connect and query from CarbonData. > > * CarbonData can integrate with various big data Query/Processing > > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. > > > > Example: > > > > > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala > > > > == Initial Goals == > > > > Our initial goals are to bring CarbonData into the ASF, transition > > internal engineering processes into the open, and foster a collaborative > > development model according to the "Apache Way". > > > > == Current Status == > > > > CarbonData is production ready and already provide a large set of > features. > > The current license is already Apache 2.0. > > > > == Meritocracy == > > > > We intend to radically expand the initial developer and user community > > by running the project in accordance with the "Apache Way". Users and > > new contributors will be treated with respect and welcomed. By > > participating in the community and providing quality patches/support > > that move the project forward, they will earn merit. They also will be > > encouraged to provide non-code contributions (documentation, events, > > community management, etc.) and will gain merit for doing so. Those with > > a proven support and quality track record will be encouraged to become > > committers. > > > > == Community == > > > > If CarbonData is accepted for incubation, the primary initial goal is to > > build a large community. We really trust that CarbonData will become a > > key project for big data column-like platforms, and so, we bet on a > > large community of users and developers. > > > > == Known Risks == > > > > Development has been sponsored mostly by a one company.For the project > > to fully transition to the Apache Way governance model, development must > > shift towards the meritocracy-centric model of growing a community of > > contributors balanced with the needs for extreme stability and core > > implementation coherency. > > > > == Orphaned products == > > > > Huawei is fully committed CarbonData. Moreover, Huawei has a vested > > interest in making CarbonData succeed by driving its close integration > > with sister ASF projects. We expect this to further reduces the risk of > > orphaning the product. > > > > == Inexperience with Open Source == > > > > Huawei has been developing and using open source software since a long > > time. Additionally, several ASF veterans agreed to mentor the project > > and are listed in this proposal. The project will rely on their guidance > > and collective wisdom to quickly transition the entire team of initial > > committers towards practicing the Apache Way. > > > > == Reliance on Salaried Developers == > > > > Most of the contributors are paid to work in big data space. While they > > might wander from their current employers, they are unlikely to venture > > far from their core expertises and thus will continue to be engaged with > > the project regardless of their current employers. > > > > == An Excessive Fascination with the Apache Brand == > > > > While we intend to leverage the Apache ‘branding’ when talking to other > > projects as testament of our project’s ‘neutrality’, we have no plans > > for making use of Apache brand in press releases nor posting billboards > > advertising acceptance of CarbonData into Apache Incubator. > > > > == Initial Source == > > > > https://github.com/HuaweiBigData/carbondata.git > > > > == External Dependencies == > > > > All external dependencies are licensed under an Apache 2.0 license or > > Apache-compatible license. As we grow the Carbondata community we will > > configure our build process to require and validate all contributions > > and dependencies are licensed under the Apache 2.0 license or are under > > an Apache-compatible license. > > > > * Apache Spark > > * Apache Hadoop > > * Apache Maven > > * Apache Commons > > * Apache Log4j > > * Apache Thrift > > * Apache Zookeeper > > * Scala > > * Snappy > > * Kettle (Pentaho) > > * Eigenbase > > * Fastutil > > * GSON > > * Jmockit > > * Junit > > > > == Required Resources == > > > > === Mailing lists === > > > > * priv...@carbondata.incubator.apache.org (moderated subscriptions) > > * comm...@carbondata.incubator.apache.org > > * d...@carbondata.incubator.apache.org > > * iss...@carbondata.incubator.apache.org > > > > === Git Repository === > > > > * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git > > > > === Issue Tracking === > > > > * JIRA Project CarbonData (CarbonData) > > > > === Initial Committers === > > > > * Liang Chenliang > > * Jean-Baptiste Onofré > > * Henry Saputra > > * Uma Maheswara Rao G > > * Jenny MA > > * Jacky Likun > > * Vimal Das Kammath > > * Jarray Qiuheng > > > > === Affiliations === > > > > * Huawei: Liang Chenliang > > * Talend: Jean-Baptiste Onofré > > * Ebay: Henry Saputra > > * Intel: Uma Maheswara Rao G > > > > === Sponsors === > > > > === Champion === > > > > * Jean-Baptiste Onofré - Apache Member > > > > === Mentors === > > > > * Henry Saputra (eBay) > > * Jean-Baptiste Onofré (Talend) > > * Uma Maheswara Rao G (Intel) > > > > === Sponsoring Entity === > > > > The Apache Incubator > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > >