Re: [PROPOSAL] Apache AsterixDB Incubator

Till Westmann Wed, 21 Jan 2015 00:38:15 -0800

Thanks!

Till


> On Jan 20, 2015, at 11:06, Ted Dunning <[email protected]> wrote:
> 
> 
> Added my name to the mentor list.
> 
> 
> 
>> On Tue, Jan 20, 2015 at 8:37 AM, Mike Carey <[email protected]> wrote:
>> Wonderful; thanks, Ted!!
>> Cheers,
>> Mike
>> 
>>> On 1/19/15 11:29 PM, Ted Dunning wrote:
>>> 
>>> Chris just asked me under separate cover. 
>>> 
>>> I am happy to help out as mentor.
>>> 
>>> 
>>> 
>>> On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <[email protected]> 
>>> wrote:
>>>> Thanks Till,
>>>> 
>>>> Will try to solicit more mentors to help.
>>>> Especially with initial committers mostly have not been exposed to
>>>> contributing the Apache way.
>>>> 
>>>> - Henry
>>>> 
>>>> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <[email protected]> wrote:
>>>> > Hi Henry,
>>>> >
>>>> > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
>>>> >
>>>> > Even if your time is very limited we would be very happy to have you on 
>>>> > board as a mentor.
>>>> > I’ll add you to the proposal.
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <[email protected]> 
>>>> >> wrote:
>>>> >>
>>>> >> +1 This is GREAT News!
>>>> >>
>>>> >> Was watching and trying AsterixDB last year and looked in awesome shape.
>>>> >>
>>>> >> I have my plate full but would love to help mentor this project to get
>>>> >> it going to ASF if needed!
>>>> >>
>>>> >> - Henry
>>>> >>
>>>> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
>>>> >> <[email protected]> wrote:
>>>> >>> Hi Folks,
>>>> >>>
>>>> >>> I am pleased to bring forth the Apache AsterixDB proposal to the
>>>> >>> Apache Incubator as Champion, working in collaboration with the
>>>> >>> team. Please find the wiki proposal here:
>>>> >>>
>>>> >>> https://wiki.apache.org/incubator/AsterixDBProposal
>>>> >>>
>>>> >>>
>>>> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
>>>> >>> leave the discussion open for a week, and then look to call a VOTE
>>>> >>> hopefully end of next week if all is well.
>>>> >>>
>>>> >>> Cheers!
>>>> >>> Chris Mattmann
>>>> >>>
>>>> >>> =============================================================
>>>> >>> Apache AsterixDB Proposal
>>>> >>>
>>>> >>> Abstract
>>>> >>>
>>>> >>> Apache AsterixDB is a scalable big data management system (BDMS) that
>>>> >>> provides storage, management, and query capabilities for large
>>>> >>> collections of semi-structured data.
>>>> >>>
>>>> >>> Proposal
>>>> >>>
>>>> >>> AsterixDB is a big data management system (BDMS) that makes it
>>>> >>> well-suited to needs such as web data warehousing and social data
>>>> >>> storage and analysis. Feature-wise, AsterixDB has:
>>>> >>>
>>>> >>> * A NoSQL style data model (ADM) based on extending JSON with object
>>>> >>>  database concepts.
>>>> >>> * An expressive and declarative query language (AQL) for querying
>>>> >>>  semi-structured data.
>>>> >>> * A runtime query execution engine, Hyracks, for partitioned-parallel
>>>> >>>  execution of query plans.
>>>> >>> * Partitioned LSM-based data storage and indexing for efficient
>>>> >>>  ingestion of newly arriving data.
>>>> >>> * Support for querying and indexing external data (e.g., in HDFS) as
>>>> >>>  well as data stored within AsterixDB.
>>>> >>> * A rich set of primitive data types, including support for spatial,
>>>> >>>  temporal, and textual data.
>>>> >>> * Indexing options that include B+ trees, R trees, and inverted
>>>> >>>  keyword index support.
>>>> >>> * Basic transactional (concurrency and recovery) capabilities akin to
>>>> >>>  those of a NoSQL store.
>>>> >>>
>>>> >>>
>>>> >>> Background and Rationale
>>>> >>>
>>>> >>> In the world of relational databases, the need to tackle data volumes
>>>> >>> that exceed the capabilities of a single server led to the
>>>> >>> development of “shared-nothing” parallel database systems several
>>>> >>> decades ago. These systems spread data over a cluster based on a
>>>> >>> partitioning strategy, such as hash partitioning, and queries are
>>>> >>> processed by employing partitioned-parallel divide-and-conquer
>>>> >>> techniques. Since these systems are fronted by a high-level,
>>>> >>> declarative language (SQL), their users are shielded from the
>>>> >>> complexities of parallel programming. Parallel database systems have
>>>> >>> been an extremely successful application of parallel computing, and
>>>> >>> quite a number of commercial products exist today.
>>>> >>>
>>>> >>> In the distributed systems world, the Web brought a need to index and
>>>> >>> query its huge content. SQL and relational databases were not the
>>>> >>> answer, though shared-nothing clusters again emerged as the hardware
>>>> >>> platform of choice. Google developed the Google File System (GFS) and
>>>> >>> MapReduce programming model to allow programmers to store and process
>>>> >>> Big Data by writing a few user-defined functions. The MapReduce
>>>> >>> framework applies these functions in parallel to data instances in
>>>> >>> distributed files (map) and to sorted groups of instances sharing a
>>>> >>> common key (reduce) -- not unlike the partitioned parallelism in
>>>> >>> parallel database systems. Apache's Hadoop MapReduce platform is the
>>>> >>> most prominent implementation of this paradigm for the rest of the
>>>> >>> Big Data community. On top of Hadoop and HDFS sit declarative
>>>> >>> languages like Pig and Hive that each compile down to Hadoop
>>>> >>> MapReduce jobs.
>>>> >>>
>>>> >>> The big Web companies were also challenged by extreme user bases
>>>> >>> (100s of millions of users) and needed fast simple lookups and
>>>> >>> updates to very large keyed data sets like user profiles. SQL
>>>> >>> databases were deemed either too expensive or not scalable, so the
>>>> >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
>>>> >>> popular key-value stores, in this space. MongoDB and Couchbase are
>>>> >>> other open source alternatives (document stores).
>>>> >>>
>>>> >>> It is evident from the rapidly growing popularity of "NoSQL" stores,
>>>> >>> as well as the strong demand for Big Data analytics engines today,
>>>> >>> that there is a strong (and growing!) need to store, process, *and*
>>>> >>> query large volumes of semi-structured data in many application
>>>> >>> areas. Until very recently, developers have had to ``choose'' between
>>>> >>> using big data analytics engines like Apache Hive or Apache Spark,
>>>> >>> which can do complex query processing and analysis over HDFS-resident
>>>> >>> files, and flexible but low-function data stores like MongoDB or
>>>> >>> Apache HBase. (The Apache Phoenix project,
>>>> >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
>>>> >>> aims to bridge between these choices.)
>>>> >>>
>>>> >>> AsterixDB is a highly scalable data management system that can store,
>>>> >>> index, and manage semi-structured data, e.g., much like MongoDB, but
>>>> >>> it also supports a full-power query language with the expressiveness
>>>> >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
>>>> >>> stores and manages data, so AsterixDB can exploit its knowledge of
>>>> >>> data partitioning and the availability of indexes to avoid always
>>>> >>> scanning data set(s) to process queries. Somewhat surprisingly, there
>>>> >>> is no open source parallel database system (relational or otherwise)
>>>> >>> available to developers today -- AsterixDB aims to fill this need.
>>>> >>> Since Apache is where the majority of the today's most important Big
>>>> >>> Data technologies live, the ASF seems like the obvious home for a
>>>> >>> system like AsterixDB.
>>>> >>>
>>>> >>> Current Status
>>>> >>>
>>>> >>> The current version of AsterixDB was co-developed by a team of
>>>> >>> faculty, staff, and students at UC Irvine and UC Riverside. The
>>>> >>> project was initiated as a large NSF-sponsored project in 2009, the
>>>> >>> goal of which was to combine the best ideas from the parallel
>>>> >>> database world, the then new Hadoop world, and the semi-structured
>>>> >>> (e.g., XML/JSON) data world in order to create a next-generation
>>>> >>> BDMS. A first informal open source release was made four years later,
>>>> >>> in June of 2013, under the Apache Software License 2.0.
>>>> >>>
>>>> >>>
>>>> >>> Meritocracy
>>>> >>>
>>>> >>> The current developers are familiar with meritocratic open source
>>>> >>> development at Apache. Apache was chosen specifically because we want
>>>> >>> to encourage this style of development for the project.
>>>> >>>
>>>> >>>
>>>> >>> Community
>>>> >>>
>>>> >>> While AsterixDB started as a university project it has developed into
>>>> >>> a community. A number of the initial committers started contributing
>>>> >>> in academia and continue to actively participate and contribute after
>>>> >>> graduation. And we seek to further develop developer and user
>>>> >>> communities. One way to broaden the community that is ongoing is
>>>> >>> through academic collaborations (currently with IIT Mumbai in India
>>>> >>> and TU Berlin in Germany). During incubation we will also explicitly
>>>> >>> seek increased industrial participation.
>>>> >>>
>>>> >>> Some indicators of the effort's development community and history can
>>>> >>> be
>>>> >>> found at:
>>>> >>> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
>>>> >>> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>>>> >>>
>>>> >>>
>>>> >>> Core Developers
>>>> >>>
>>>> >>> The core developers of the project are diverse, although initially UC
>>>> >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
>>>> >>> other 50 are from other academic institutions (UC Riverside and the
>>>> >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
>>>> >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>>>> >>>
>>>> >>>
>>>> >>> Alignment
>>>> >>>
>>>> >>> Apache is, by far, the most natural home for taking the AsterixDB
>>>> >>> project forward. A large fraction of today's top Big Data
>>>> >>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
>>>> >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
>>>> >>> significant gap -- the parallel data management system gap -- that
>>>> >>> exists in the Big Data open source world. It is well-aligned with a
>>>> >>> number of the Apache projects, e.g., it has strong support for
>>>> >>> accessing and indexing external data in HDFS, and it uses YARN as an
>>>> >>> answer to basic cluster resource management. AsterixDB also seeks to
>>>> >>> achieve an Apache-style development model; it is seeking a broader
>>>> >>> community of contributors and users in order to achieve its full
>>>> >>> potential and value to the Big Data community.
>>>> >>>
>>>> >>> There are also a number of related Apache projects and dependencies
>>>> >>> that will be mentioned below in the Relationships with Other Apache
>>>> >>> products section.
>>>> >>>
>>>> >>>
>>>> >>> Known Risks
>>>> >>>
>>>> >>> Orphaned products
>>>> >>>
>>>> >>> Given the current level of intellectual investment in AsterixDB, the
>>>> >>> risk of the project being abandoned is very small. The UCI/UCR
>>>> >>> faculty team leads are highly incentivized to continue development
>>>> >>> since the database groups at UC Irvine and UC Riverside are both
>>>> >>> reliant on AsterixDB as a platform for long-term graduate research
>>>> >>> projects. UC San Diego is also beginning to contribute to the code
>>>> >>> base, and a collaboration involving public health applications is
>>>> >>> forming with UCLA. The work on AsterixDB is managed via a mix of
>>>> >>> mailing list discussions supplemented by weekly project status
>>>> >>> meetings which are summarized on the mailing list. Typical (local
>>>> >>> plus Skype-in) attendance to the weekly status meetings runs at about
>>>> >>> 20 active contributors.
>>>> >>>
>>>> >>>
>>>> >>> Inexperience with Open Source
>>>> >>>
>>>> >>> AsterixDB and Hyracks were completely developed in Open Source under
>>>> >>> the ASL 2.0. The source code repositories, issue tracker, and mailing
>>>> >>> lists are available on Google Code and discussions and decisions
>>>> >>> happen on the mailing lists (which is necessary due to the geographic
>>>> >>> distribution of the current developers).
>>>> >>>
>>>> >>> Also a few of the initial committers have contributed to Apache
>>>> >>> projects. Vinayak Borkar is a committer on the Apache Helix and
>>>> >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
>>>> >>> and an IPMC member. Preston Carman and Steven Jacobs are committers
>>>> >>> on the Apache VXQuery project.
>>>> >>>
>>>> >>>
>>>> >>> Relationships with Other Apache Products
>>>> >>>
>>>> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
>>>> >>> is also included in the AsterixDB code base.
>>>> >>>
>>>> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
>>>> >>> is support for accessing external data in HDFS (and Hive formats),
>>>> >>> and resource management and system administration features are in the
>>>> >>> process of being migrated to YARN.
>>>> >>>
>>>> >>> AsterixDB's AQL query facilities offer comparable query power to
>>>> >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
>>>> >>> differs in storing and indexing data and thus being able to quickly
>>>> >>> answer small and medium queries without large HDFS data scans -
>>>> >>> thereby targeting a different class of use cases.
>>>> >>>
>>>> >>> AsterixDB's data storage and indexing facilities are similar to those
>>>> >>> of HBase, but AsterixDB differs in being a much more complete and
>>>> >>> queryable BDMS (not just a key-value style store).
>>>> >>>
>>>> >>> AsterixDB's target use cases are not in-memory processing or
>>>> >>> iterative algorithm support, making AsterixDB complementary to the
>>>> >>> Apache Spark platform. (Spark interoperability is on our longer-term
>>>> >>> to-do wishlist.)
>>>> >>>
>>>> >>>
>>>> >>> Homogeneous Developers
>>>> >>>
>>>> >>> As mentioned before the current community is already organizationally
>>>> >>> and geographically distributed - and we would like to increase the
>>>> >>> heterogeneity.
>>>> >>>
>>>> >>>
>>>> >>> Reliance on Salaried Developers
>>>> >>>
>>>> >>> Of the initial committers only 3 are full-time UCI staff. The other
>>>> >>> committers are a mix of students, alumni who continue to contribute
>>>> >>> to the effort, and individuals working with permission part-time (or
>>>> >>> in spare time) on this project.
>>>> >>>
>>>> >>>
>>>> >>> A Excessive Fascination with the Apache Brand
>>>> >>>
>>>> >>> We believe in the processes, systems, and framework Apache has put in
>>>> >>> place. Apache is also known to foster a great community around their
>>>> >>> projects and provide exposure. While brand is important, our
>>>> >>> fascination with it is not excessive. We believe that the ASF is the
>>>> >>> right home for AsterixDB and that having AsterixDB inside of the ASF
>>>> >>> will lead to a better long-term outcome for the Big Data community.
>>>> >>>
>>>> >>>
>>>> >>> Documentation
>>>> >>>
>>>> >>> Documentation and publications related to AsterixDB can be found at
>>>> >>> http://asterixdb.ics.uci.edu/.
>>>> >>>
>>>> >>>
>>>> >>> Initial Source
>>>> >>>
>>>> >>> Current source resides in Google code:
>>>> >>> https://code.google.com/p/asterixdb/ (query language and upper system
>>>> >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
>>>> >>> system and storage management libraries).
>>>> >>>
>>>> >>>
>>>> >>> External Dependencies
>>>> >>>
>>>> >>> AsterixDB depends on a number of Apache projects:
>>>> >>>
>>>> >>> - Ant
>>>> >>> - Avro
>>>> >>> - ApacheDB JDO
>>>> >>> - Commons
>>>> >>> - Derby
>>>> >>> - Hadoop
>>>> >>> - Hive
>>>> >>> - HTTPComponents
>>>> >>> - Jakarta ORO
>>>> >>> - Maven
>>>> >>> - Tomcat
>>>> >>> - Thrift
>>>> >>> - Velocity
>>>> >>> - Wicket
>>>> >>> - Xerces
>>>> >>>
>>>> >>> and other open source projects (organized by license):
>>>> >>>
>>>> >>> -- ASL 2.0:
>>>> >>> - Jackson
>>>> >>> - Google Guava
>>>> >>> - Google Guice
>>>> >>> - JSON-simple
>>>> >>> - BoneCP
>>>> >>> - Microsoft Azure SDK
>>>> >>> - Netty
>>>> >>> - Rome
>>>> >>> - JetS3t
>>>> >>> - Groovy
>>>> >>> - Jettison
>>>> >>> - Plexus
>>>> >>> - Datanucleus (JDO)
>>>> >>> - Jetty
>>>> >>> - Twitter4J
>>>> >>> - Snappy-java
>>>> >>>
>>>> >>> -- BSD:
>>>> >>> - Antlr
>>>> >>> - ObjectWeb ASM
>>>> >>> - Protobuf
>>>> >>> - JSCH
>>>> >>> - JavaCC
>>>> >>> - Paranamer
>>>> >>> - JLine
>>>> >>> - Stax
>>>> >>> - StringTemplate
>>>> >>> - xmlEnc
>>>> >>>
>>>> >>> -- MIT
>>>> >>> - AppAssembler
>>>> >>> - SimpleLog4J
>>>> >>>
>>>> >>> -- CDDL 1.0
>>>> >>> - Java Activation Framework
>>>> >>> - Java Transactions
>>>> >>> - Java Servlet API
>>>> >>> - Grizzly
>>>> >>> - gmbal
>>>> >>> - Glassfish
>>>> >>>
>>>> >>> -- CDDL 1.1
>>>> >>> - Jersey
>>>> >>> - JAXB Reference Implementation
>>>> >>>
>>>> >>> -- JSON License
>>>> >>> - JSON
>>>> >>>
>>>> >>> -- EPL 1.0
>>>> >>> - JUnit
>>>> >>>
>>>> >>> -- JDOM License
>>>> >>> - JDOM
>>>> >>>
>>>> >>> -- Public Domain
>>>> >>> - xz
>>>> >>> - AOPAlliance
>>>> >>>
>>>> >>> As all dependencies are managed using Apache Maven, none of the
>>>> >>> external libraries need to be packaged in a source distribution.
>>>> >>>
>>>> >>>
>>>> >>> Required Resources
>>>> >>>
>>>> >>> Developer and user mailing lists
>>>> >>>
>>>> >>> [email protected] (with moderated subscriptions)
>>>> >>> [email protected]
>>>> >>> [email protected]
>>>> >>> [email protected]
>>>> >>>
>>>> >>>
>>>> >>> A git repository
>>>> >>>
>>>> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>>>> >>>
>>>> >>>
>>>> >>> A JIRA issue tracker
>>>> >>>
>>>> >>> https://issues.apache.org/jira/browse/ASTERIXDB
>>>> >>>
>>>> >>>
>>>> >>> Initial Committers
>>>> >>>
>>>> >>> The following is a list of the planned initial Apache committers (the
>>>> >>> active subset of the committers for the current repository at Google
>>>> >>> code).
>>>> >>>
>>>> >>> Abdullah Alamoudi ([email protected])
>>>> >>> Cameron Samak ([email protected])
>>>> >>> Chen Li ([email protected])
>>>> >>> Ian Maxon ([email protected])
>>>> >>> Ildar Absalyamov ([email protected])
>>>> >>> Jianfeng Jia ([email protected])
>>>> >>> Karen Ouaknine ([email protected])
>>>> >>> Markus Dreseler ([email protected])
>>>> >>> Mike Carey ([email protected])
>>>> >>> Murtadha Hubail ([email protected])
>>>> >>> Pouria Pirzadeh ([email protected])
>>>> >>> Preston Carman ([email protected])
>>>> >>> Raman Grover ([email protected])
>>>> >>> Sattam Alsubaiee ([email protected])
>>>> >>> Steven Jacobs ([email protected])
>>>> >>> Taewoo Kim ([email protected])
>>>> >>> Till Westmann ([email protected])
>>>> >>> Vinayak Borkar ([email protected])
>>>> >>> Yingyi Bu ([email protected])
>>>> >>> Young-Seok Kim ([email protected])
>>>> >>> Zach Heilbron ([email protected])
>>>> >>>
>>>> >>>
>>>> >>> Affiliations
>>>> >>>
>>>> >>> UC Irvine
>>>> >>> - Mike Carey
>>>> >>> - Chen Li
>>>> >>> - Ian Maxon
>>>> >>> - Yingyi Bu
>>>> >>> - Raman Grover
>>>> >>> - Pouria Pirzadeh
>>>> >>> - Young-Seok Kim
>>>> >>> - Cameron Samak
>>>> >>> - Taewoo Kim
>>>> >>> - Jianfeng Jia
>>>> >>> - Murtadha Hubail
>>>> >>> - Markus Dreseler
>>>> >>>
>>>> >>> UC Riverside
>>>> >>> - Ildar Absalyamov
>>>> >>> - Preston Carman
>>>> >>> - Steven Jacobs
>>>> >>>
>>>> >>> Hebrew University
>>>> >>> - Keren Ouaknine
>>>> >>>
>>>> >>> Oracle
>>>> >>> - Till Westmann
>>>> >>>
>>>> >>> X15 Software
>>>> >>> - Vinayak Borkar
>>>> >>> - Zach Heilbron
>>>> >>>
>>>> >>> KACST Saudi Arabia
>>>> >>> - Sattam Alsubaiee
>>>> >>>
>>>> >>> Saudi Aramco
>>>> >>> - Abdullah Alamoudi
>>>> >>>
>>>> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
>>>> >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
>>>> >>> non-UC committers are a mix of alumni who continue to contribute to
>>>> >>> the effort and individuals working with permission part-time (or in
>>>> >>> spare time) on this project.
>>>> >>>
>>>> >>>
>>>> >>> Sponsors
>>>> >>>
>>>> >>> Champion
>>>> >>>
>>>> >>> Chris Mattmann (NASA/JPL)
>>>> >>>
>>>> >>> Nominated Mentors
>>>> >>>
>>>> >>> TBD
>>>> >>>
>>>> >>> Sponsoring Entity
>>>> >>>
>>>> >>> The Apache Incubator
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> Chris Mattmann, Ph.D.
>>>> >>> Chief Architect
>>>> >>> Instrument Software and Science Data Systems Section (398)
>>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> >>> Office: 168-519, Mailstop: 168-527
>>>> >>> Email: [email protected]
>>>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>> Adjunct Associate Professor, Computer Science Department
>>>> >>> University of Southern California, Los Angeles, CA 90089 USA
>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Reply via email to