Re: [VOTE] Accept CarbonData into the Apache Incubator

Jacques Nadeau Wed, 25 May 2016 17:26:24 -0700

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org>
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept
> > CarbonData into the Incubator.
> >
> > [ ] +1 Accept CarbonData into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for faster
> > interactive
> > query using advanced columnar storage, index, compression and encoding
> > techniques
> > to improve computing efficiency, in turn it will help speedup queries an
> > order of
> > magnitude faster over PetaBytes of data.
> >
> > CarbonData github address: https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing
> > customer experiences for telecom carriers, enterprises, and consumers on
> > big data, In order to satisfy the following customer requirements, we
> > created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching all
> > fields.
> >   * Fast data loading speed and support incremental load in period of
> > minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats in
> > the Hadoop eco-system, but we could not find a suitable solution that
> > satisfying requirements all at the same time, so we start designing
> > CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for file
> > format such as columnar,index,dictionary,encoding+compression,API for
> > reading/writing etc.
> >   2. CarbonData integration with big data processing framework such as
> > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> > the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many features
> > that a modern columnar format has, such as splittable, compression
> > schema ,complex data type etc. And CarbonData has following unique
> > features:
> >
> > ==== Indexing ====
> >
> > In order to support fast interactive query, CarbonData leverage indexing
> > technology to reduce I/O scans. CarbonData files stores data along with
> > index, the index is not stored separately but the CarbonData file itself
> > contains the index. In current implementation, CarbonData supports 3
> > types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each
> > data blocks each column block is written in sequence. Finally, the
> > metadata block for the file is written with information about byte
> > positions of each block in the file, Min-Max statistics index and the
> > start and end MDK of each data block. Since, the entire data in the file
> > is in sorted order, the start and end MDK of each data block can be used
> > to construct a B+Tree and the file can be logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this index,
> > it helps processing/query engine to do filtering inside one HDFS block.
> > Furthermore, query acceleration for count distinct like operation is
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query
> > engine can skip scan that is not required.
> >
> > ==== Global Dictionary ====
> >
> > Besides I/O reduction, CarbonData accelerates computation by using
> > global dictionary, which enables processing/query engines to perform all
> > processing on encoded data without having to convert the data (Late
> > Materialization). We have observed dramatic performance improvement for
> > OLAP analytic scenario where table contains many columns in string data
> > type. The data is converted back to the user readable form just before
> > processing/query engine returning results to user.
> >
> > ==== Column Group ====
> >
> > Sometimes users want to perform processing/query on multi-columns in one
> > table, for example, performing scan for individual record in
> > troubleshooting scenario. In this case, row format is more efficient
> > than columnar format since all columns will be touched by the workload.
> > To accelerate this, CarbonData supports storing a group of column in row
> > format, so data in column group is stored together and enable fast
> > retrieval.
> >
> > ==== Optimized for multiple use cases ====
> >
> > CarbonData indices and dictionary is highly configurable. To make
> > storage optimized for different use cases, user can configure what to
> > index, so user can decide and tune the format before loading data into
> > CarbonData.
> >
> > For example
> >
> > || Use Case || Supporting Features ||
> > || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> > Tree index), Minmax index, Inverted index ||
> > || High throughput scan || Global dictionary, Minmax index ||
> > || Low latency point query || Multi-dimensional Key (B+ Tree index),
> > Partitioning ||
> > || Individual record query || Column group, Global dictionary ||
> >
> > === BigData Processing Framework Integration ===
> >
> >   * CarbonData provides InputFormat/OutputFormat interfaces for
> > Reading/Writing data from the CarbonData files and at the same time
> > provides abstract API for processing data stored as Carbondata format
> > with data processing framework.
> >   * CarbonData provides deep integration with Apache Spark including
> > predicate push down, column pruning, aggregation push down etc. So users
> > can use Spark SQL to connect and query from CarbonData.
> >   * CarbonData can integrate with various big data Query/Processing
> > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> >
> > Example:
> >
> >
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> >
> > == Initial Goals ==
> >
> > Our initial goals are to bring CarbonData into the ASF, transition
> > internal engineering processes into the open, and foster a collaborative
> > development model according to the "Apache Way".
> >
> > == Current Status ==
> >
> > CarbonData is production ready and already provide a large set of
> features.
> > The current license is already Apache 2.0.
> >
> > == Meritocracy ==
> >
> > We intend to radically expand the initial developer and user community
> > by running the project in accordance with the "Apache Way". Users and
> > new contributors will be treated with respect and welcomed. By
> > participating in the community and providing quality patches/support
> > that move the project forward, they will earn merit. They also will be
> > encouraged to provide non-code contributions (documentation, events,
> > community management, etc.) and will gain merit for doing so. Those with
> > a proven support and quality track record will be encouraged to become
> > committers.
> >
> > == Community ==
> >
> > If CarbonData is accepted for incubation, the primary initial goal is to
> > build a large community. We really trust that CarbonData will become a
> > key project for big data column-like platforms, and so, we bet on a
> > large community of users and developers.
> >
> > == Known Risks ==
> >
> > Development has been sponsored mostly by a one company.For the project
> > to fully transition to the Apache Way governance model, development must
> > shift towards the meritocracy-centric model of growing a community of
> > contributors balanced with the needs for extreme stability and core
> > implementation coherency.
> >
> > == Orphaned products ==
> >
> > Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> > interest in making CarbonData succeed by driving its close integration
> > with sister ASF projects. We expect this to further reduces the risk of
> > orphaning the product.
> >
> > == Inexperience with Open Source ==
> >
> > Huawei has been developing and using open source software since a long
> > time. Additionally, several ASF veterans agreed to mentor the project
> > and are listed in this proposal. The project will rely on their guidance
> > and collective wisdom to quickly transition the entire team of initial
> > committers towards practicing the Apache Way.
> >
> > == Reliance on Salaried Developers ==
> >
> > Most of the contributors are paid to work in big data space. While they
> > might wander from their current employers, they are unlikely to venture
> > far from their core expertises and thus will continue to be engaged with
> > the project regardless of their current employers.
> >
> > == An Excessive Fascination with the Apache Brand ==
> >
> > While we intend to leverage the Apache ‘branding’ when talking to other
> > projects as testament of our project’s ‘neutrality’, we have no plans
> > for making use of Apache brand in press releases nor posting billboards
> > advertising acceptance of CarbonData into Apache Incubator.
> >
> > == Initial Source ==
> >
> > https://github.com/HuaweiBigData/carbondata.git
> >
> > == External Dependencies ==
> >
> > All external dependencies are licensed under an Apache 2.0 license or
> > Apache-compatible license. As we grow the Carbondata community we will
> > configure our build process to require and validate all contributions
> > and dependencies are licensed under the Apache 2.0 license or are under
> > an Apache-compatible license.
> >
> >   * Apache Spark
> >   * Apache Hadoop
> >   * Apache Maven
> >   * Apache Commons
> >   * Apache Log4j
> >   * Apache Thrift
> >   * Apache Zookeeper
> >   * Scala
> >   * Snappy
> >   * Kettle (Pentaho)
> >   * Eigenbase
> >   * Fastutil
> >   * GSON
> >   * Jmockit
> >   * Junit
> >
> > == Required Resources ==
> >
> > === Mailing lists ===
> >
> >   * priv...@carbondata.incubator.apache.org (moderated subscriptions)
> >   * comm...@carbondata.incubator.apache.org
> >   * d...@carbondata.incubator.apache.org
> >   * iss...@carbondata.incubator.apache.org
> >
> > === Git Repository ===
> >
> >   * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> >
> > === Issue Tracking ===
> >
> >   * JIRA Project CarbonData (CarbonData)
> >
> > === Initial Committers ===
> >
> >   * Liang Chenliang
> >   * Jean-Baptiste Onofré
> >   * Henry Saputra
> >   * Uma Maheswara Rao G
> >   * Jenny MA
> >   * Jacky Likun
> >   * Vimal Das Kammath
> >   * Jarray Qiuheng
> >
> > === Affiliations ===
> >
> >   * Huawei: Liang Chenliang
> >   * Talend: Jean-Baptiste Onofré
> >   * Ebay: Henry Saputra
> >   * Intel: Uma Maheswara Rao G
> >
> > === Sponsors ===
> >
> > === Champion ===
> >
> >   * Jean-Baptiste Onofré - Apache Member
> >
> > === Mentors ===
> >
> >   * Henry Saputra (eBay)
> >   * Jean-Baptiste Onofré (Talend)
> >   * Uma Maheswara Rao G (Intel)
> >
> > === Sponsoring Entity ===
> >
> > The Apache Incubator
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Reply via email to