+1 (non-binding) Thks Amol
On Fri, May 27, 2016 at 5:53 AM, Jim Jagielski <j...@jagunet.com> wrote: > Thx for the feedback... > > I change my vote to +1 (binding) > > On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > > Hi Jim, > > > > good point. Let me try to explain this "gap" regarding my discussion > with the team: > > > > 1. Some people have been involved mostly in architecture and design more > directly in code. That's why they are part of the initial committer list, > whereas they didn't really provide "visible" code on github. > > > > 2. Some people are no more involved in the project. That's why they > don't appear on the initial committer list. > > > > Regards > > JB > > > > On 05/26/2016 05:45 PM, Jim Jagielski wrote: > >> I am trying to align the list of initial committers with > >> the list of current/active contributors, according to > >> Github, and I am seeing people proposed who have not > >> contributed anything and people NOT proposed who seem > >> to be kinda active... > >> > >> Sooo..... -0 > >> > >>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >>> > >>> Hi all, > >>> > >>> following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > >>> > >>> [ ] +1 Accept CarbonData into the Apache Incubator > >>> [ ] +0 Abstain > >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > >>> > >>> This vote is open for 72 hours. > >>> > >>> The proposal follows, you can also access the wiki page: > >>> https://wiki.apache.org/incubator/CarbonDataProposal > >>> > >>> Thanks ! > >>> Regards > >>> JB > >>> > >>> = Apache CarbonData = > >>> > >>> == Abstract == > >>> > >>> Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > >>> query using advanced columnar storage, index, compression and encoding > techniques > >>> to improve computing efficiency, in turn it will help speedup queries > an order of > >>> magnitude faster over PetaBytes of data. > >>> > >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata > >>> > >>> == Background == > >>> > >>> Huawei is an ICT solution provider, we are committed to enhancing > customer experiences for telecom carriers, enterprises, and consumers on > big data, In order to satisfy the following customer requirements, we > created a new Hadoop native file format: > >>> > >>> * Support interactive OLAP-style query over big data in seconds. > >>> * Support fast query on individual record which require touching all > fields. > >>> * Fast data loading speed and support incremental load in period of > minutes. > >>> * Support HDFS so that customer can leverage existing Hadoop cluster. > >>> * Support time based data retention. > >>> > >>> Based on these requirements, we investigated existing file formats in > the Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > >>> > >>> == Rationale == > >>> > >>> CarbonData contains multiple modules, which are classified into two > categories: > >>> > >>> 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > >>> 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. > >>> > >>> === CarbonData File Format === > >>> > >>> CarbonData file format is a columnar store in HDFS, it has many > features that a modern columnar format has, such as splittable, compression > schema ,complex data type etc. And CarbonData has following unique features: > >>> > >>> ==== Indexing ==== > >>> > >>> In order to support fast interactive query, CarbonData leverage > indexing technology to reduce I/O scans. CarbonData files stores data along > with index, the index is not stored separately but the CarbonData file > itself contains the index. In current implementation, CarbonData supports 3 > types of indexing: > >>> > >>> 1. Multi-dimensional Key (B+ Tree index) > >>> The Data block are written in sequence to the disk and within each > data blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of each > block in the file, Min-Max statistics index and the start and end MDK of > each data block. Since, the entire data in the file is in sorted order, the > start and end MDK of each data block can be used to construct a B+Tree and > the file can be logically represented as a B+Tree with the data blocks as > leaf nodes (on disk) and the remaining non-leaf nodes in memory. > >>> 2. Inverted index > >>> Inverted index is widely used in search engine. By using this index, > it helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is made > possible when combining bitmap and inverted index in query time. > >>> 3. MinMax index > >>> For all columns, minmax index is created so that processing/query > engine can skip scan that is not required. > >>> > >>> ==== Global Dictionary ==== > >>> > >>> Besides I/O reduction, CarbonData accelerates computation by using > global dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > >>> > >>> ==== Column Group ==== > >>> > >>> Sometimes users want to perform processing/query on multi-columns in > one table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient than > columnar format since all columns will be touched by the workload. To > accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > >>> > >>> ==== Optimized for multiple use cases ==== > >>> > >>> CarbonData indices and dictionary is highly configurable. To make > storage optimized for different use cases, user can configure what to > index, so user can decide and tune the format before loading data into > CarbonData. > >>> > >>> For example > >>> > >>> || Use Case || Supporting Features || > >>> || Interactive OLAP query || Columnar format, Multi-dimensional Key > (B+ Tree index), Minmax index, Inverted index || > >>> || High throughput scan || Global dictionary, Minmax index || > >>> || Low latency point query || Multi-dimensional Key (B+ Tree index), > Partitioning || > >>> || Individual record query || Column group, Global dictionary || > >>> > >>> === BigData Processing Framework Integration === > >>> > >>> * CarbonData provides InputFormat/OutputFormat interfaces for > Reading/Writing data from the CarbonData files and at the same time > provides abstract API for processing data stored as Carbondata format with > data processing framework. > >>> * CarbonData provides deep integration with Apache Spark including > predicate push down, column pruning, aggregation push down etc. So users > can use Spark SQL to connect and query from CarbonData. > >>> * CarbonData can integrate with various big data Query/Processing > framework on Hadoop eco-system such as Apache Spark,Apache Hive etc. > >>> > >>> Example: > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala > >>> > >>> == Initial Goals == > >>> > >>> Our initial goals are to bring CarbonData into the ASF, transition > internal engineering processes into the open, and foster a collaborative > development model according to the "Apache Way". > >>> > >>> == Current Status == > >>> > >>> CarbonData is production ready and already provide a large set of > features. > >>> The current license is already Apache 2.0. > >>> > >>> == Meritocracy == > >>> > >>> We intend to radically expand the initial developer and user community > by running the project in accordance with the "Apache Way". Users and new > contributors will be treated with respect and welcomed. By participating in > the community and providing quality patches/support that move the project > forward, they will earn merit. They also will be encouraged to provide > non-code contributions (documentation, events, community management, etc.) > and will gain merit for doing so. Those with a proven support and quality > track record will be encouraged to become committers. > >>> > >>> == Community == > >>> > >>> If CarbonData is accepted for incubation, the primary initial goal is > to build a large community. We really trust that CarbonData will become a > key project for big data column-like platforms, and so, we bet on a large > community of users and developers. > >>> > >>> == Known Risks == > >>> > >>> Development has been sponsored mostly by a one company.For the project > to fully transition to the Apache Way governance model, development must > shift towards the meritocracy-centric model of growing a community of > contributors balanced with the needs for extreme stability and core > implementation coherency. > >>> > >>> == Orphaned products == > >>> > >>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested > interest in making CarbonData succeed by driving its close integration with > sister ASF projects. We expect this to further reduces the risk of > orphaning the product. > >>> > >>> == Inexperience with Open Source == > >>> > >>> Huawei has been developing and using open source software since a long > time. Additionally, several ASF veterans agreed to mentor the project and > are listed in this proposal. The project will rely on their guidance and > collective wisdom to quickly transition the entire team of initial > committers towards practicing the Apache Way. > >>> > >>> == Reliance on Salaried Developers == > >>> > >>> Most of the contributors are paid to work in big data space. While > they might wander from their current employers, they are unlikely to > venture far from their core expertises and thus will continue to be engaged > with the project regardless of their current employers. > >>> > >>> == An Excessive Fascination with the Apache Brand == > >>> > >>> While we intend to leverage the Apache ‘branding’ when talking to > other projects as testament of our project’s ‘neutrality’, we have no plans > for making use of Apache brand in press releases nor posting billboards > advertising acceptance of CarbonData into Apache Incubator. > >>> > >>> == Initial Source == > >>> > >>> https://github.com/HuaweiBigData/carbondata.git > >>> > >>> == External Dependencies == > >>> > >>> All external dependencies are licensed under an Apache 2.0 license or > >>> Apache-compatible license. As we grow the Carbondata community we will > >>> configure our build process to require and validate all contributions > >>> and dependencies are licensed under the Apache 2.0 license or are under > >>> an Apache-compatible license. > >>> > >>> * Apache Spark > >>> * Apache Hadoop > >>> * Apache Maven > >>> * Apache Commons > >>> * Apache Log4j > >>> * Apache Thrift > >>> * Apache Zookeeper > >>> * Scala > >>> * Snappy > >>> * Kettle (Pentaho) > >>> * Eigenbase > >>> * Fastutil > >>> * GSON > >>> * Jmockit > >>> * Junit > >>> > >>> == Required Resources == > >>> > >>> === Mailing lists === > >>> > >>> * priv...@carbondata.incubator.apache.org (moderated subscriptions) > >>> * comm...@carbondata.incubator.apache.org > >>> * d...@carbondata.incubator.apache.org > >>> * iss...@carbondata.incubator.apache.org > >>> > >>> === Git Repository === > >>> > >>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git > >>> > >>> === Issue Tracking === > >>> > >>> * JIRA Project CarbonData (CarbonData) > >>> > >>> === Initial Committers === > >>> > >>> * Liang Chenliang > >>> * Jean-Baptiste Onofré > >>> * Henry Saputra > >>> * Uma Maheswara Rao G > >>> * Jenny MA > >>> * Jacky Likun > >>> * Vimal Das Kammath > >>> * Jarray Qiuheng > >>> > >>> === Affiliations === > >>> > >>> * Huawei: Liang Chenliang > >>> * Talend: Jean-Baptiste Onofré > >>> * Ebay: Henry Saputra > >>> * Intel: Uma Maheswara Rao G > >>> > >>> === Sponsors === > >>> > >>> === Champion === > >>> > >>> * Jean-Baptiste Onofré - Apache Member > >>> > >>> === Mentors === > >>> > >>> * Henry Saputra (eBay) > >>> * Jean-Baptiste Onofré (Talend) > >>> * Uma Maheswara Rao G (Intel) > >>> > >>> === Sponsoring Entity === > >>> > >>> The Apache Incubator > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>> For additional commands, e-mail: general-h...@incubator.apache.org > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >> For additional commands, e-mail: general-h...@incubator.apache.org > >> > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >