Re: [VOTE] Accept CarbonData into the Apache Incubator

Amol Kekre Fri, 27 May 2016 07:08:38 -0700

+1 (non-binding)

Thks
Amol


On Fri, May 27, 2016 at 5:53 AM, Jim Jagielski <j...@jagunet.com> wrote:

> Thx for the feedback...
>
> I change my vote to +1 (binding)
> > On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >
> > Hi Jim,
> >
> > good point. Let me try to explain this "gap" regarding my discussion
> with the team:
> >
> > 1. Some people have been involved mostly in architecture and design more
> directly in code. That's why they are part of the initial committer list,
> whereas they didn't really provide "visible" code on github.
> >
> > 2. Some people are no more involved in the project. That's why they
> don't appear on the initial committer list.
> >
> > Regards
> > JB
> >
> > On 05/26/2016 05:45 PM, Jim Jagielski wrote:
> >> I am trying to align the list of initial committers with
> >> the list of current/active contributors, according to
> >> Github, and I am seeing people proposed who have not
> >> contributed anything and people NOT proposed who seem
> >> to be kinda active...
> >>
> >> Sooo..... -0
> >>
> >>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
> >>>
> >>> [ ] +1 Accept CarbonData into the Apache Incubator
> >>> [ ] +0 Abstain
> >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >>>
> >>> This vote is open for 72 hours.
> >>>
> >>> The proposal follows, you can also access the wiki page:
> >>> https://wiki.apache.org/incubator/CarbonDataProposal
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> = Apache CarbonData =
> >>>
> >>> == Abstract ==
> >>>
> >>> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> >>> query using advanced columnar storage, index, compression and encoding
> techniques
> >>> to improve computing efficiency, in turn it will help speedup queries
> an order of
> >>> magnitude faster over PetaBytes of data.
> >>>
> >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> >>>
> >>> == Background ==
> >>>
> >>> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
> >>>
> >>> * Support interactive OLAP-style query over big data in seconds.
> >>> * Support fast query on individual record which require touching all
> fields.
> >>> * Fast data loading speed and support incremental load in period of
> minutes.
> >>> * Support HDFS so that customer can leverage existing Hadoop cluster.
> >>> * Support time based data retention.
> >>>
> >>> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
> >>>
> >>> == Rationale ==
> >>>
> >>> CarbonData contains multiple modules, which are classified into two
> categories:
> >>>
> >>> 1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
> >>> 2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
> >>>
> >>> === CarbonData File Format ===
> >>>
> >>> CarbonData file format is a columnar store in HDFS, it has many
> features that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique features:
> >>>
> >>> ==== Indexing ====
> >>>
> >>> In order to support fast interactive query, CarbonData leverage
> indexing technology to reduce I/O scans. CarbonData files stores data along
> with index, the index is not stored separately but the CarbonData file
> itself contains the index. In current implementation, CarbonData supports 3
> types of indexing:
> >>>
> >>> 1. Multi-dimensional Key (B+ Tree index)
> >>> The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> >>> 2. Inverted index
> >>> Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> >>> 3. MinMax index
> >>> For all columns, minmax index is created so that processing/query
> engine can skip scan that is not required.
> >>>
> >>> ==== Global Dictionary ====
> >>>
> >>> Besides I/O reduction, CarbonData accelerates computation by using
> global dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
> >>>
> >>> ==== Column Group ====
> >>>
> >>> Sometimes users want to perform processing/query on multi-columns in
> one table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
> >>>
> >>> ==== Optimized for multiple use cases ====
> >>>
> >>> CarbonData indices and dictionary is highly configurable. To make
> storage optimized for different use cases, user can configure what to
> index, so user can decide and tune the format before loading data into
> CarbonData.
> >>>
> >>> For example
> >>>
> >>> || Use Case || Supporting Features ||
> >>> || Interactive OLAP query || Columnar format, Multi-dimensional Key
> (B+ Tree index), Minmax index, Inverted index ||
> >>> || High throughput scan || Global dictionary, Minmax index ||
> >>> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> >>> || Individual record query || Column group, Global dictionary ||
> >>>
> >>> === BigData Processing Framework Integration ===
> >>>
> >>> * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
> >>> * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
> >>> * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
> >>>
> >>> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
> >>>
> >>> == Initial Goals ==
> >>>
> >>> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
> >>>
> >>> == Current Status ==
> >>>
> >>> CarbonData is production ready and already provide a large set of
> features.
> >>> The current license is already Apache 2.0.
> >>>
> >>> == Meritocracy ==
> >>>
> >>> We intend to radically expand the initial developer and user community
> by running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
> >>>
> >>> == Community ==
> >>>
> >>> If CarbonData is accepted for incubation, the primary initial goal is
> to build a large community. We really trust that CarbonData will become a
> key project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
> >>>
> >>> == Known Risks ==
> >>>
> >>> Development has been sponsored mostly by a one company.For the project
> to fully transition to the Apache Way governance model, development must
> shift towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
> >>>
> >>> == Orphaned products ==
> >>>
> >>> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
> >>>
> >>> == Inexperience with Open Source ==
> >>>
> >>> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
> >>>
> >>> == Reliance on Salaried Developers ==
> >>>
> >>> Most of the contributors are paid to work in big data space. While
> they might wander from their current employers, they are unlikely to
> venture far from their core expertises and thus will continue to be engaged
> with the project regardless of their current employers.
> >>>
> >>> == An Excessive Fascination with the Apache Brand ==
> >>>
> >>> While we intend to leverage the Apache ‘branding’ when talking to
> other projects as testament of our project’s ‘neutrality’, we have no plans
> for making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
> >>>
> >>> == Initial Source ==
> >>>
> >>> https://github.com/HuaweiBigData/carbondata.git
> >>>
> >>> == External Dependencies ==
> >>>
> >>> All external dependencies are licensed under an Apache 2.0 license or
> >>> Apache-compatible license. As we grow the Carbondata community we will
> >>> configure our build process to require and validate all contributions
> >>> and dependencies are licensed under the Apache 2.0 license or are under
> >>> an Apache-compatible license.
> >>>
> >>> * Apache Spark
> >>> * Apache Hadoop
> >>> * Apache Maven
> >>> * Apache Commons
> >>> * Apache Log4j
> >>> * Apache Thrift
> >>> * Apache Zookeeper
> >>> * Scala
> >>> * Snappy
> >>> * Kettle (Pentaho)
> >>> * Eigenbase
> >>> * Fastutil
> >>> * GSON
> >>> * Jmockit
> >>> * Junit
> >>>
> >>> == Required Resources ==
> >>>
> >>> === Mailing lists ===
> >>>
> >>> * priv...@carbondata.incubator.apache.org (moderated subscriptions)
> >>> * comm...@carbondata.incubator.apache.org
> >>> * d...@carbondata.incubator.apache.org
> >>> * iss...@carbondata.incubator.apache.org
> >>>
> >>> === Git Repository ===
> >>>
> >>> * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
> >>>
> >>> === Issue Tracking ===
> >>>
> >>> * JIRA Project CarbonData (CarbonData)
> >>>
> >>> === Initial Committers ===
> >>>
> >>> * Liang Chenliang
> >>> * Jean-Baptiste Onofré
> >>> * Henry Saputra
> >>> * Uma Maheswara Rao G
> >>> * Jenny MA
> >>> * Jacky Likun
> >>> * Vimal Das Kammath
> >>> * Jarray Qiuheng
> >>>
> >>> === Affiliations ===
> >>>
> >>> * Huawei: Liang Chenliang
> >>> * Talend: Jean-Baptiste Onofré
> >>> * Ebay: Henry Saputra
> >>> * Intel: Uma Maheswara Rao G
> >>>
> >>> === Sponsors ===
> >>>
> >>> === Champion ===
> >>>
> >>> * Jean-Baptiste Onofré - Apache Member
> >>>
> >>> === Mentors ===
> >>>
> >>> * Henry Saputra (eBay)
> >>> * Jean-Baptiste Onofré (Talend)
> >>> * Uma Maheswara Rao G (Intel)
> >>>
> >>> === Sponsoring Entity ===
> >>>
> >>> The Apache Incubator
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >>> For additional commands, e-mail: general-h...@incubator.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Reply via email to