+1 Thanks, Madhawa
Madhawa On Fri, May 27, 2016 at 11:16 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Jim, > > good point. Let me try to explain this "gap" regarding my discussion with > the team: > > 1. Some people have been involved mostly in architecture and design more > directly in code. That's why they are part of the initial committer list, > whereas they didn't really provide "visible" code on github. > > 2. Some people are no more involved in the project. That's why they don't > appear on the initial committer list. > > Regards > JB > > > On 05/26/2016 05:45 PM, Jim Jagielski wrote: > >> I am trying to align the list of initial committers with >> the list of current/active contributors, according to >> Github, and I am seeing people proposed who have not >> contributed anything and people NOT proposed who seem >> to be kinda active... >> >> Sooo..... -0 >> >> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>> Hi all, >>> >>> following the discussion thread, I'm now calling a vote to accept >>> CarbonData into the Incubator. >>> >>> [ ] +1 Accept CarbonData into the Apache Incubator >>> [ ] +0 Abstain >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... >>> >>> This vote is open for 72 hours. >>> >>> The proposal follows, you can also access the wiki page: >>> https://wiki.apache.org/incubator/CarbonDataProposal >>> >>> Thanks ! >>> Regards >>> JB >>> >>> = Apache CarbonData = >>> >>> == Abstract == >>> >>> Apache CarbonData is a new Apache Hadoop native file format for faster >>> interactive >>> query using advanced columnar storage, index, compression and encoding >>> techniques >>> to improve computing efficiency, in turn it will help speedup queries an >>> order of >>> magnitude faster over PetaBytes of data. >>> >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata >>> >>> == Background == >>> >>> Huawei is an ICT solution provider, we are committed to enhancing >>> customer experiences for telecom carriers, enterprises, and consumers on >>> big data, In order to satisfy the following customer requirements, we >>> created a new Hadoop native file format: >>> >>> * Support interactive OLAP-style query over big data in seconds. >>> * Support fast query on individual record which require touching all >>> fields. >>> * Fast data loading speed and support incremental load in period of >>> minutes. >>> * Support HDFS so that customer can leverage existing Hadoop cluster. >>> * Support time based data retention. >>> >>> Based on these requirements, we investigated existing file formats in >>> the Hadoop eco-system, but we could not find a suitable solution that >>> satisfying requirements all at the same time, so we start designing >>> CarbonData. >>> >>> == Rationale == >>> >>> CarbonData contains multiple modules, which are classified into two >>> categories: >>> >>> 1. CarbonData File Format: which contains core implementation for file >>> format such as columnar,index,dictionary,encoding+compression,API for >>> reading/writing etc. >>> 2. CarbonData integration with big data processing framework such as >>> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the >>> execution runtime. >>> >>> === CarbonData File Format === >>> >>> CarbonData file format is a columnar store in HDFS, it has many features >>> that a modern columnar format has, such as splittable, compression schema >>> ,complex data type etc. And CarbonData has following unique features: >>> >>> ==== Indexing ==== >>> >>> In order to support fast interactive query, CarbonData leverage indexing >>> technology to reduce I/O scans. CarbonData files stores data along with >>> index, the index is not stored separately but the CarbonData file itself >>> contains the index. In current implementation, CarbonData supports 3 types >>> of indexing: >>> >>> 1. Multi-dimensional Key (B+ Tree index) >>> The Data block are written in sequence to the disk and within each data >>> blocks each column block is written in sequence. Finally, the metadata >>> block for the file is written with information about byte positions of each >>> block in the file, Min-Max statistics index and the start and end MDK of >>> each data block. Since, the entire data in the file is in sorted order, the >>> start and end MDK of each data block can be used to construct a B+Tree and >>> the file can be logically represented as a B+Tree with the data blocks as >>> leaf nodes (on disk) and the remaining non-leaf nodes in memory. >>> 2. Inverted index >>> Inverted index is widely used in search engine. By using this index, it >>> helps processing/query engine to do filtering inside one HDFS block. >>> Furthermore, query acceleration for count distinct like operation is made >>> possible when combining bitmap and inverted index in query time. >>> 3. MinMax index >>> For all columns, minmax index is created so that processing/query engine >>> can skip scan that is not required. >>> >>> ==== Global Dictionary ==== >>> >>> Besides I/O reduction, CarbonData accelerates computation by using >>> global dictionary, which enables processing/query engines to perform all >>> processing on encoded data without having to convert the data (Late >>> Materialization). We have observed dramatic performance improvement for >>> OLAP analytic scenario where table contains many columns in string data >>> type. The data is converted back to the user readable form just before >>> processing/query engine returning results to user. >>> >>> ==== Column Group ==== >>> >>> Sometimes users want to perform processing/query on multi-columns in one >>> table, for example, performing scan for individual record in >>> troubleshooting scenario. In this case, row format is more efficient than >>> columnar format since all columns will be touched by the workload. To >>> accelerate this, CarbonData supports storing a group of column in row >>> format, so data in column group is stored together and enable fast >>> retrieval. >>> >>> ==== Optimized for multiple use cases ==== >>> >>> CarbonData indices and dictionary is highly configurable. To make >>> storage optimized for different use cases, user can configure what to >>> index, so user can decide and tune the format before loading data into >>> CarbonData. >>> >>> For example >>> >>> || Use Case || Supporting Features || >>> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+ >>> Tree index), Minmax index, Inverted index || >>> || High throughput scan || Global dictionary, Minmax index || >>> || Low latency point query || Multi-dimensional Key (B+ Tree index), >>> Partitioning || >>> || Individual record query || Column group, Global dictionary || >>> >>> === BigData Processing Framework Integration === >>> >>> * CarbonData provides InputFormat/OutputFormat interfaces for >>> Reading/Writing data from the CarbonData files and at the same time >>> provides abstract API for processing data stored as Carbondata format with >>> data processing framework. >>> * CarbonData provides deep integration with Apache Spark including >>> predicate push down, column pruning, aggregation push down etc. So users >>> can use Spark SQL to connect and query from CarbonData. >>> * CarbonData can integrate with various big data Query/Processing >>> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. >>> >>> Example: >>> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala >>> >>> == Initial Goals == >>> >>> Our initial goals are to bring CarbonData into the ASF, transition >>> internal engineering processes into the open, and foster a collaborative >>> development model according to the "Apache Way". >>> >>> == Current Status == >>> >>> CarbonData is production ready and already provide a large set of >>> features. >>> The current license is already Apache 2.0. >>> >>> == Meritocracy == >>> >>> We intend to radically expand the initial developer and user community >>> by running the project in accordance with the "Apache Way". Users and new >>> contributors will be treated with respect and welcomed. By participating in >>> the community and providing quality patches/support that move the project >>> forward, they will earn merit. They also will be encouraged to provide >>> non-code contributions (documentation, events, community management, etc.) >>> and will gain merit for doing so. Those with a proven support and quality >>> track record will be encouraged to become committers. >>> >>> == Community == >>> >>> If CarbonData is accepted for incubation, the primary initial goal is to >>> build a large community. We really trust that CarbonData will become a key >>> project for big data column-like platforms, and so, we bet on a large >>> community of users and developers. >>> >>> == Known Risks == >>> >>> Development has been sponsored mostly by a one company.For the project >>> to fully transition to the Apache Way governance model, development must >>> shift towards the meritocracy-centric model of growing a community of >>> contributors balanced with the needs for extreme stability and core >>> implementation coherency. >>> >>> == Orphaned products == >>> >>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested >>> interest in making CarbonData succeed by driving its close integration with >>> sister ASF projects. We expect this to further reduces the risk of >>> orphaning the product. >>> >>> == Inexperience with Open Source == >>> >>> Huawei has been developing and using open source software since a long >>> time. Additionally, several ASF veterans agreed to mentor the project and >>> are listed in this proposal. The project will rely on their guidance and >>> collective wisdom to quickly transition the entire team of initial >>> committers towards practicing the Apache Way. >>> >>> == Reliance on Salaried Developers == >>> >>> Most of the contributors are paid to work in big data space. While they >>> might wander from their current employers, they are unlikely to venture far >>> from their core expertises and thus will continue to be engaged with the >>> project regardless of their current employers. >>> >>> == An Excessive Fascination with the Apache Brand == >>> >>> While we intend to leverage the Apache ‘branding’ when talking to other >>> projects as testament of our project’s ‘neutrality’, we have no plans for >>> making use of Apache brand in press releases nor posting billboards >>> advertising acceptance of CarbonData into Apache Incubator. >>> >>> == Initial Source == >>> >>> https://github.com/HuaweiBigData/carbondata.git >>> >>> == External Dependencies == >>> >>> All external dependencies are licensed under an Apache 2.0 license or >>> Apache-compatible license. As we grow the Carbondata community we will >>> configure our build process to require and validate all contributions >>> and dependencies are licensed under the Apache 2.0 license or are under >>> an Apache-compatible license. >>> >>> * Apache Spark >>> * Apache Hadoop >>> * Apache Maven >>> * Apache Commons >>> * Apache Log4j >>> * Apache Thrift >>> * Apache Zookeeper >>> * Scala >>> * Snappy >>> * Kettle (Pentaho) >>> * Eigenbase >>> * Fastutil >>> * GSON >>> * Jmockit >>> * Junit >>> >>> == Required Resources == >>> >>> === Mailing lists === >>> >>> * priv...@carbondata.incubator.apache.org (moderated subscriptions) >>> * comm...@carbondata.incubator.apache.org >>> * d...@carbondata.incubator.apache.org >>> * iss...@carbondata.incubator.apache.org >>> >>> === Git Repository === >>> >>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git >>> >>> === Issue Tracking === >>> >>> * JIRA Project CarbonData (CarbonData) >>> >>> === Initial Committers === >>> >>> * Liang Chenliang >>> * Jean-Baptiste Onofré >>> * Henry Saputra >>> * Uma Maheswara Rao G >>> * Jenny MA >>> * Jacky Likun >>> * Vimal Das Kammath >>> * Jarray Qiuheng >>> >>> === Affiliations === >>> >>> * Huawei: Liang Chenliang >>> * Talend: Jean-Baptiste Onofré >>> * Ebay: Henry Saputra >>> * Intel: Uma Maheswara Rao G >>> >>> === Sponsors === >>> >>> === Champion === >>> >>> * Jean-Baptiste Onofré - Apache Member >>> >>> === Mentors === >>> >>> * Henry Saputra (eBay) >>> * Jean-Baptiste Onofré (Talend) >>> * Uma Maheswara Rao G (Intel) >>> >>> === Sponsoring Entity === >>> >>> The Apache Incubator >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >