Re: [PROPOSAL] Apache AsterixDB Incubator

Ted Dunning Mon, 19 Jan 2015 23:32:05 -0800

Chris just asked me under separate cover.

I am happy to help out as mentor.




On Mon, Jan 19, 2015 at 8:17 PM, Henry Saputra <henry.sapu...@gmail.com>
wrote:

> Thanks Till,
>
> Will try to solicit more mentors to help.
> Especially with initial committers mostly have not been exposed to
> contributing the Apache way.
>
> - Henry
>
> On Mon, Jan 19, 2015 at 5:28 PM, Till Westmann <t...@westmann.org> wrote:
> > Hi Henry,
> >
> > thanks! It’s great that you’ve seen (and liked) AsterixDB before.
> >
> > Even if your time is very limited we would be very happy to have you on
> board as a mentor.
> > I’ll add you to the proposal.
> >
> > Cheers,
> > Till
> >
> >> On Jan 19, 2015, at 10:26 AM, Henry Saputra <henry.sapu...@gmail.com>
> wrote:
> >>
> >> +1 This is GREAT News!
> >>
> >> Was watching and trying AsterixDB last year and looked in awesome shape.
> >>
> >> I have my plate full but would love to help mentor this project to get
> >> it going to ASF if needed!
> >>
> >> - Henry
> >>
> >> On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
> >> <chris.a.mattm...@jpl.nasa.gov> wrote:
> >>> Hi Folks,
> >>>
> >>> I am pleased to bring forth the Apache AsterixDB proposal to the
> >>> Apache Incubator as Champion, working in collaboration with the
> >>> team. Please find the wiki proposal here:
> >>>
> >>> https://wiki.apache.org/incubator/AsterixDBProposal
> >>>
> >>>
> >>> Full text of the proposal is below. Please discuss and enjoy. I’ll
> >>> leave the discussion open for a week, and then look to call a VOTE
> >>> hopefully end of next week if all is well.
> >>>
> >>> Cheers!
> >>> Chris Mattmann
> >>>
> >>> =============================================================
> >>> Apache AsterixDB Proposal
> >>>
> >>> Abstract
> >>>
> >>> Apache AsterixDB is a scalable big data management system (BDMS) that
> >>> provides storage, management, and query capabilities for large
> >>> collections of semi-structured data.
> >>>
> >>> Proposal
> >>>
> >>> AsterixDB is a big data management system (BDMS) that makes it
> >>> well-suited to needs such as web data warehousing and social data
> >>> storage and analysis. Feature-wise, AsterixDB has:
> >>>
> >>> * A NoSQL style data model (ADM) based on extending JSON with object
> >>>  database concepts.
> >>> * An expressive and declarative query language (AQL) for querying
> >>>  semi-structured data.
> >>> * A runtime query execution engine, Hyracks, for partitioned-parallel
> >>>  execution of query plans.
> >>> * Partitioned LSM-based data storage and indexing for efficient
> >>>  ingestion of newly arriving data.
> >>> * Support for querying and indexing external data (e.g., in HDFS) as
> >>>  well as data stored within AsterixDB.
> >>> * A rich set of primitive data types, including support for spatial,
> >>>  temporal, and textual data.
> >>> * Indexing options that include B+ trees, R trees, and inverted
> >>>  keyword index support.
> >>> * Basic transactional (concurrency and recovery) capabilities akin to
> >>>  those of a NoSQL store.
> >>>
> >>>
> >>> Background and Rationale
> >>>
> >>> In the world of relational databases, the need to tackle data volumes
> >>> that exceed the capabilities of a single server led to the
> >>> development of “shared-nothing” parallel database systems several
> >>> decades ago. These systems spread data over a cluster based on a
> >>> partitioning strategy, such as hash partitioning, and queries are
> >>> processed by employing partitioned-parallel divide-and-conquer
> >>> techniques. Since these systems are fronted by a high-level,
> >>> declarative language (SQL), their users are shielded from the
> >>> complexities of parallel programming. Parallel database systems have
> >>> been an extremely successful application of parallel computing, and
> >>> quite a number of commercial products exist today.
> >>>
> >>> In the distributed systems world, the Web brought a need to index and
> >>> query its huge content. SQL and relational databases were not the
> >>> answer, though shared-nothing clusters again emerged as the hardware
> >>> platform of choice. Google developed the Google File System (GFS) and
> >>> MapReduce programming model to allow programmers to store and process
> >>> Big Data by writing a few user-defined functions. The MapReduce
> >>> framework applies these functions in parallel to data instances in
> >>> distributed files (map) and to sorted groups of instances sharing a
> >>> common key (reduce) -- not unlike the partitioned parallelism in
> >>> parallel database systems. Apache's Hadoop MapReduce platform is the
> >>> most prominent implementation of this paradigm for the rest of the
> >>> Big Data community. On top of Hadoop and HDFS sit declarative
> >>> languages like Pig and Hive that each compile down to Hadoop
> >>> MapReduce jobs.
> >>>
> >>> The big Web companies were also challenged by extreme user bases
> >>> (100s of millions of users) and needed fast simple lookups and
> >>> updates to very large keyed data sets like user profiles. SQL
> >>> databases were deemed either too expensive or not scalable, so the
> >>> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> >>> popular key-value stores, in this space. MongoDB and Couchbase are
> >>> other open source alternatives (document stores).
> >>>
> >>> It is evident from the rapidly growing popularity of "NoSQL" stores,
> >>> as well as the strong demand for Big Data analytics engines today,
> >>> that there is a strong (and growing!) need to store, process, *and*
> >>> query large volumes of semi-structured data in many application
> >>> areas. Until very recently, developers have had to ``choose'' between
> >>> using big data analytics engines like Apache Hive or Apache Spark,
> >>> which can do complex query processing and analysis over HDFS-resident
> >>> files, and flexible but low-function data stores like MongoDB or
> >>> Apache HBase. (The Apache Phoenix project,
> >>> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> >>> aims to bridge between these choices.)
> >>>
> >>> AsterixDB is a highly scalable data management system that can store,
> >>> index, and manage semi-structured data, e.g., much like MongoDB, but
> >>> it also supports a full-power query language with the expressiveness
> >>> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> >>> stores and manages data, so AsterixDB can exploit its knowledge of
> >>> data partitioning and the availability of indexes to avoid always
> >>> scanning data set(s) to process queries. Somewhat surprisingly, there
> >>> is no open source parallel database system (relational or otherwise)
> >>> available to developers today -- AsterixDB aims to fill this need.
> >>> Since Apache is where the majority of the today's most important Big
> >>> Data technologies live, the ASF seems like the obvious home for a
> >>> system like AsterixDB.
> >>>
> >>> Current Status
> >>>
> >>> The current version of AsterixDB was co-developed by a team of
> >>> faculty, staff, and students at UC Irvine and UC Riverside. The
> >>> project was initiated as a large NSF-sponsored project in 2009, the
> >>> goal of which was to combine the best ideas from the parallel
> >>> database world, the then new Hadoop world, and the semi-structured
> >>> (e.g., XML/JSON) data world in order to create a next-generation
> >>> BDMS. A first informal open source release was made four years later,
> >>> in June of 2013, under the Apache Software License 2.0.
> >>>
> >>>
> >>> Meritocracy
> >>>
> >>> The current developers are familiar with meritocratic open source
> >>> development at Apache. Apache was chosen specifically because we want
> >>> to encourage this style of development for the project.
> >>>
> >>>
> >>> Community
> >>>
> >>> While AsterixDB started as a university project it has developed into
> >>> a community. A number of the initial committers started contributing
> >>> in academia and continue to actively participate and contribute after
> >>> graduation. And we seek to further develop developer and user
> >>> communities. One way to broaden the community that is ongoing is
> >>> through academic collaborations (currently with IIT Mumbai in India
> >>> and TU Berlin in Germany). During incubation we will also explicitly
> >>> seek increased industrial participation.
> >>>
> >>> Some indicators of the effort's development community and history can
> >>> be
> >>> found at:
> >>>
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo
> ,
> >>>
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
> >>>
> >>>
> >>> Core Developers
> >>>
> >>> The core developers of the project are diverse, although initially UC
> >>> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> >>> other 50 are from other academic institutions (UC Riverside and the
> >>> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> >>> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
> >>>
> >>>
> >>> Alignment
> >>>
> >>> Apache is, by far, the most natural home for taking the AsterixDB
> >>> project forward. A large fraction of today's top Big Data
> >>> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> >>> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> >>> significant gap -- the parallel data management system gap -- that
> >>> exists in the Big Data open source world. It is well-aligned with a
> >>> number of the Apache projects, e.g., it has strong support for
> >>> accessing and indexing external data in HDFS, and it uses YARN as an
> >>> answer to basic cluster resource management. AsterixDB also seeks to
> >>> achieve an Apache-style development model; it is seeking a broader
> >>> community of contributors and users in order to achieve its full
> >>> potential and value to the Big Data community.
> >>>
> >>> There are also a number of related Apache projects and dependencies
> >>> that will be mentioned below in the Relationships with Other Apache
> >>> products section.
> >>>
> >>>
> >>> Known Risks
> >>>
> >>> Orphaned products
> >>>
> >>> Given the current level of intellectual investment in AsterixDB, the
> >>> risk of the project being abandoned is very small. The UCI/UCR
> >>> faculty team leads are highly incentivized to continue development
> >>> since the database groups at UC Irvine and UC Riverside are both
> >>> reliant on AsterixDB as a platform for long-term graduate research
> >>> projects. UC San Diego is also beginning to contribute to the code
> >>> base, and a collaboration involving public health applications is
> >>> forming with UCLA. The work on AsterixDB is managed via a mix of
> >>> mailing list discussions supplemented by weekly project status
> >>> meetings which are summarized on the mailing list. Typical (local
> >>> plus Skype-in) attendance to the weekly status meetings runs at about
> >>> 20 active contributors.
> >>>
> >>>
> >>> Inexperience with Open Source
> >>>
> >>> AsterixDB and Hyracks were completely developed in Open Source under
> >>> the ASL 2.0. The source code repositories, issue tracker, and mailing
> >>> lists are available on Google Code and discussions and decisions
> >>> happen on the mailing lists (which is necessary due to the geographic
> >>> distribution of the current developers).
> >>>
> >>> Also a few of the initial committers have contributed to Apache
> >>> projects. Vinayak Borkar is a committer on the Apache Helix and
> >>> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> >>> and an IPMC member. Preston Carman and Steven Jacobs are committers
> >>> on the Apache VXQuery project.
> >>>
> >>>
> >>> Relationships with Other Apache Products
> >>>
> >>> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> >>> is also included in the AsterixDB code base.
> >>>
> >>> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> >>> is support for accessing external data in HDFS (and Hive formats),
> >>> and resource management and system administration features are in the
> >>> process of being migrated to YARN.
> >>>
> >>> AsterixDB's AQL query facilities offer comparable query power to
> >>> Apache's Pig and Hive systems for big data analytics. AsterixDB
> >>> differs in storing and indexing data and thus being able to quickly
> >>> answer small and medium queries without large HDFS data scans -
> >>> thereby targeting a different class of use cases.
> >>>
> >>> AsterixDB's data storage and indexing facilities are similar to those
> >>> of HBase, but AsterixDB differs in being a much more complete and
> >>> queryable BDMS (not just a key-value style store).
> >>>
> >>> AsterixDB's target use cases are not in-memory processing or
> >>> iterative algorithm support, making AsterixDB complementary to the
> >>> Apache Spark platform. (Spark interoperability is on our longer-term
> >>> to-do wishlist.)
> >>>
> >>>
> >>> Homogeneous Developers
> >>>
> >>> As mentioned before the current community is already organizationally
> >>> and geographically distributed - and we would like to increase the
> >>> heterogeneity.
> >>>
> >>>
> >>> Reliance on Salaried Developers
> >>>
> >>> Of the initial committers only 3 are full-time UCI staff. The other
> >>> committers are a mix of students, alumni who continue to contribute
> >>> to the effort, and individuals working with permission part-time (or
> >>> in spare time) on this project.
> >>>
> >>>
> >>> A Excessive Fascination with the Apache Brand
> >>>
> >>> We believe in the processes, systems, and framework Apache has put in
> >>> place. Apache is also known to foster a great community around their
> >>> projects and provide exposure. While brand is important, our
> >>> fascination with it is not excessive. We believe that the ASF is the
> >>> right home for AsterixDB and that having AsterixDB inside of the ASF
> >>> will lead to a better long-term outcome for the Big Data community.
> >>>
> >>>
> >>> Documentation
> >>>
> >>> Documentation and publications related to AsterixDB can be found at
> >>> http://asterixdb.ics.uci.edu/.
> >>>
> >>>
> >>> Initial Source
> >>>
> >>> Current source resides in Google code:
> >>> https://code.google.com/p/asterixdb/ (query language and upper system
> >>> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> >>> system and storage management libraries).
> >>>
> >>>
> >>> External Dependencies
> >>>
> >>> AsterixDB depends on a number of Apache projects:
> >>>
> >>> - Ant
> >>> - Avro
> >>> - ApacheDB JDO
> >>> - Commons
> >>> - Derby
> >>> - Hadoop
> >>> - Hive
> >>> - HTTPComponents
> >>> - Jakarta ORO
> >>> - Maven
> >>> - Tomcat
> >>> - Thrift
> >>> - Velocity
> >>> - Wicket
> >>> - Xerces
> >>>
> >>> and other open source projects (organized by license):
> >>>
> >>> -- ASL 2.0:
> >>> - Jackson
> >>> - Google Guava
> >>> - Google Guice
> >>> - JSON-simple
> >>> - BoneCP
> >>> - Microsoft Azure SDK
> >>> - Netty
> >>> - Rome
> >>> - JetS3t
> >>> - Groovy
> >>> - Jettison
> >>> - Plexus
> >>> - Datanucleus (JDO)
> >>> - Jetty
> >>> - Twitter4J
> >>> - Snappy-java
> >>>
> >>> -- BSD:
> >>> - Antlr
> >>> - ObjectWeb ASM
> >>> - Protobuf
> >>> - JSCH
> >>> - JavaCC
> >>> - Paranamer
> >>> - JLine
> >>> - Stax
> >>> - StringTemplate
> >>> - xmlEnc
> >>>
> >>> -- MIT
> >>> - AppAssembler
> >>> - SimpleLog4J
> >>>
> >>> -- CDDL 1.0
> >>> - Java Activation Framework
> >>> - Java Transactions
> >>> - Java Servlet API
> >>> - Grizzly
> >>> - gmbal
> >>> - Glassfish
> >>>
> >>> -- CDDL 1.1
> >>> - Jersey
> >>> - JAXB Reference Implementation
> >>>
> >>> -- JSON License
> >>> - JSON
> >>>
> >>> -- EPL 1.0
> >>> - JUnit
> >>>
> >>> -- JDOM License
> >>> - JDOM
> >>>
> >>> -- Public Domain
> >>> - xz
> >>> - AOPAlliance
> >>>
> >>> As all dependencies are managed using Apache Maven, none of the
> >>> external libraries need to be packaged in a source distribution.
> >>>
> >>>
> >>> Required Resources
> >>>
> >>> Developer and user mailing lists
> >>>
> >>> priv...@asterixdb.incubator.apache.org (with moderated subscriptions)
> >>> comm...@asterixdb.incubator.apache.org
> >>> d...@asterixdb.incubator.apache.org
> >>> us...@asterixdb.incubator.apache.org
> >>>
> >>>
> >>> A git repository
> >>>
> >>> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
> >>>
> >>>
> >>> A JIRA issue tracker
> >>>
> >>> https://issues.apache.org/jira/browse/ASTERIXDB
> >>>
> >>>
> >>> Initial Committers
> >>>
> >>> The following is a list of the planned initial Apache committers (the
> >>> active subset of the committers for the current repository at Google
> >>> code).
> >>>
> >>> Abdullah Alamoudi (bamou...@gmail.com)
> >>> Cameron Samak (euf...@gmail.com)
> >>> Chen Li (che...@gmail.com)
> >>> Ian Maxon (ima...@uci.edu)
> >>> Ildar Absalyamov (ildar.absalya...@gmail.com)
> >>> Jianfeng Jia (jianfeng....@gmail.com)
> >>> Karen Ouaknine (ker...@gmail.com)
> >>> Markus Dreseler (apa...@dreseler.de)
> >>> Mike Carey (dtab...@apache.org)
> >>> Murtadha Hubail (hubail...@gmail.com)
> >>> Pouria Pirzadeh (pouria.pirza...@gmail.com)
> >>> Preston Carman (prest...@apache.org)
> >>> Raman Grover (ramangrove...@gmail.com)
> >>> Sattam Alsubaiee (salsuba...@gmail.com)
> >>> Steven Jacobs (sjaco...@apache.org)
> >>> Taewoo Kim (wangs...@gmail.com)
> >>> Till Westmann (ti...@apache.org)
> >>> Vinayak Borkar (vinay...@apache.org)
> >>> Yingyi Bu (buyin...@gmail.com)
> >>> Young-Seok Kim (kiss...@gmail.com)
> >>> Zach Heilbron (zheilb...@gmail.com)
> >>>
> >>>
> >>> Affiliations
> >>>
> >>> UC Irvine
> >>> - Mike Carey
> >>> - Chen Li
> >>> - Ian Maxon
> >>> - Yingyi Bu
> >>> - Raman Grover
> >>> - Pouria Pirzadeh
> >>> - Young-Seok Kim
> >>> - Cameron Samak
> >>> - Taewoo Kim
> >>> - Jianfeng Jia
> >>> - Murtadha Hubail
> >>> - Markus Dreseler
> >>>
> >>> UC Riverside
> >>> - Ildar Absalyamov
> >>> - Preston Carman
> >>> - Steven Jacobs
> >>>
> >>> Hebrew University
> >>> - Keren Ouaknine
> >>>
> >>> Oracle
> >>> - Till Westmann
> >>>
> >>> X15 Software
> >>> - Vinayak Borkar
> >>> - Zach Heilbron
> >>>
> >>> KACST Saudi Arabia
> >>> - Sattam Alsubaiee
> >>>
> >>> Saudi Aramco
> >>> - Abdullah Alamoudi
> >>>
> >>> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> >>> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> >>> non-UC committers are a mix of alumni who continue to contribute to
> >>> the effort and individuals working with permission part-time (or in
> >>> spare time) on this project.
> >>>
> >>>
> >>> Sponsors
> >>>
> >>> Champion
> >>>
> >>> Chris Mattmann (NASA/JPL)
> >>>
> >>> Nominated Mentors
> >>>
> >>> TBD
> >>>
> >>> Sponsoring Entity
> >>>
> >>> The Apache Incubator
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Chris Mattmann, Ph.D.
> >>> Chief Architect
> >>> Instrument Software and Science Data Systems Section (398)
> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>> Office: 168-519, Mailstop: 168-527
> >>> Email: chris.a.mattm...@nasa.gov
> >>> WWW:  http://sunset.usc.edu/~mattmann/
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> Adjunct Associate Professor, Computer Science Department
> >>> University of Southern California, Los Angeles, CA 90089 USA
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [PROPOSAL] Apache AsterixDB Incubator

Reply via email to