Re: [VOTE] Accept CarbonData into the Apache Incubator

Jim Jagielski Thu, 26 May 2016 08:46:26 -0700

I am trying to align the list of initial committers with
the list of current/active contributors, according to
Github, and I am seeing people proposed who have not
contributed anything and people NOT proposed who seem
to be kinda active...


Sooo..... -0

> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData 
> into the Incubator.
> 
> [ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster 
> interactive
> query using advanced columnar storage, index, compression and encoding 
> techniques
> to improve computing efficiency, in turn it will help speedup queries an 
> order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer 
> experiences for telecom carriers, enterprises, and consumers on big data, In 
> order to satisfy the following customer requirements, we created a new Hadoop 
> native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the 
> Hadoop eco-system, but we could not find a suitable solution that satisfying 
> requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two 
> categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format 
> such as columnar,index,dictionary,encoding+compression,API for 
> reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache 
> Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution 
> runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that 
> a modern columnar format has, such as splittable, compression schema ,complex 
> data type etc. And CarbonData has following unique features:
> 
> ==== Indexing ====
> 
> In order to support fast interactive query, CarbonData leverage indexing 
> technology to reduce I/O scans. CarbonData files stores data along with 
> index, the index is not stored separately but the CarbonData file itself 
> contains the index. In current implementation, CarbonData supports 3 types of 
> indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data 
> blocks each column block is written in sequence. Finally, the metadata block 
> for the file is written with information about byte positions of each block 
> in the file, Min-Max statistics index and the start and end MDK of each data 
> block. Since, the entire data in the file is in sorted order, the start and 
> end MDK of each data block can be used to construct a B+Tree and the file can 
> be logically  represented as a B+Tree with the data blocks as leaf nodes (on 
> disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps 
> processing/query engine to do filtering inside one HDFS block. Furthermore, 
> query acceleration for count distinct like operation is made possible when 
> combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can 
> skip scan that is not required.
> 
> ==== Global Dictionary ====
> 
> Besides I/O reduction, CarbonData accelerates computation by using global 
> dictionary, which enables processing/query engines to perform all processing 
> on encoded data without having to convert the data (Late Materialization). We 
> have observed dramatic performance improvement for OLAP analytic scenario 
> where table contains many columns in string data type. The data is converted 
> back to the user readable form just before processing/query engine returning 
> results to user.
> 
> ==== Column Group ====
> 
> Sometimes users want to perform processing/query on multi-columns in one 
> table, for example, performing scan for individual record in troubleshooting 
> scenario. In this case, row format is more efficient than columnar format 
> since all columns will be touched by the workload. To accelerate this, 
> CarbonData supports storing a group of column in row format, so data in 
> column group is stored together and enable fast retrieval.
> 
> ==== Optimized for multiple use cases ====
> 
> CarbonData indices and dictionary is highly configurable. To make storage 
> optimized for different use cases, user can configure what to index, so user 
> can decide and tune the format before loading data into CarbonData.
> 
> For example
> 
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ Tree 
> index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index), 
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
> 
> === BigData Processing Framework Integration ===
> 
> * CarbonData provides InputFormat/OutputFormat interfaces for Reading/Writing 
> data from the CarbonData files and at the same time provides abstract API for 
> processing data stored as Carbondata format with data processing framework.
> * CarbonData provides deep integration with Apache Spark including predicate 
> push down, column pruning, aggregation push down etc. So users can use Spark 
> SQL to connect and query from CarbonData.
> * CarbonData can integrate with various big data Query/Processing framework 
> on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> 
> Example: 
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> 
> == Initial Goals ==
> 
> Our initial goals are to bring CarbonData into the ASF, transition internal 
> engineering processes into the open, and foster a collaborative development 
> model according to the "Apache Way".
> 
> == Current Status ==
> 
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
> 
> == Meritocracy ==
> 
> We intend to radically expand the initial developer and user community by 
> running the project in accordance with the "Apache Way". Users and new 
> contributors will be treated with respect and welcomed. By participating in 
> the community and providing quality patches/support that move the project 
> forward, they will earn merit. They also will be encouraged to provide 
> non-code contributions (documentation, events, community management, etc.) 
> and will gain merit for doing so. Those with a proven support and quality 
> track record will be encouraged to become committers.
> 
> == Community ==
> 
> If CarbonData is accepted for incubation, the primary initial goal is to 
> build a large community. We really trust that CarbonData will become a key 
> project for big data column-like platforms, and so, we bet on a large 
> community of users and developers.
> 
> == Known Risks ==
> 
> Development has been sponsored mostly by a one company.For the project to 
> fully transition to the Apache Way governance model, development must shift 
> towards the meritocracy-centric model of growing a community of contributors 
> balanced with the needs for extreme stability and core implementation 
> coherency.
> 
> == Orphaned products ==
> 
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested interest 
> in making CarbonData succeed by driving its close integration with sister ASF 
> projects. We expect this to further reduces the risk of orphaning the product.
> 
> == Inexperience with Open Source ==
> 
> Huawei has been developing and using open source software since a long time. 
> Additionally, several ASF veterans agreed to mentor the project and are 
> listed in this proposal. The project will rely on their guidance and 
> collective wisdom to quickly transition the entire team of initial committers 
> towards practicing the Apache Way.
> 
> == Reliance on Salaried Developers ==
> 
> Most of the contributors are paid to work in big data space. While they might 
> wander from their current employers, they are unlikely to venture far from 
> their core expertises and thus will continue to be engaged with the project 
> regardless of their current employers.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> While we intend to leverage the Apache ‘branding’ when talking to other 
> projects as testament of our project’s ‘neutrality’, we have no plans for 
> making use of Apache brand in press releases nor posting billboards 
> advertising acceptance of CarbonData into Apache Incubator.
> 
> == Initial Source ==
> 
> https://github.com/HuaweiBigData/carbondata.git
> 
> == External Dependencies ==
> 
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
> 
> * Apache Spark
> * Apache Hadoop
> * Apache Maven
> * Apache Commons
> * Apache Log4j
> * Apache Thrift
> * Apache Zookeeper
> * Scala
> * Snappy
> * Kettle (Pentaho)
> * Eigenbase
> * Fastutil
> * GSON
> * Jmockit
> * Junit
> 
> == Required Resources ==
> 
> === Mailing lists ===
> 
> * priv...@carbondata.incubator.apache.org (moderated subscriptions)
> * comm...@carbondata.incubator.apache.org
> * d...@carbondata.incubator.apache.org
> * iss...@carbondata.incubator.apache.org
> 
> === Git Repository ===
> 
> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> 
> === Issue Tracking ===
> 
> * JIRA Project CarbonData (CarbonData)
> 
> === Initial Committers ===
> 
> * Liang Chenliang
> * Jean-Baptiste Onofré
> * Henry Saputra
> * Uma Maheswara Rao G
> * Jenny MA
> * Jacky Likun
> * Vimal Das Kammath
> * Jarray Qiuheng
> 
> === Affiliations ===
> 
> * Huawei: Liang Chenliang
> * Talend: Jean-Baptiste Onofré
> * Ebay: Henry Saputra
> * Intel: Uma Maheswara Rao G
> 
> === Sponsors ===
> 
> === Champion ===
> 
> * Jean-Baptiste Onofré - Apache Member
> 
> === Mentors ===
> 
> * Henry Saputra (eBay)
> * Jean-Baptiste Onofré (Talend)
> * Uma Maheswara Rao G (Intel)
> 
> === Sponsoring Entity ===
> 
> The Apache Incubator
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] Accept CarbonData into the Apache Incubator

Reply via email to